Computer Vision and Pattern Recognition 115
☆ Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts
Sara Ghaboura, Ketan More, Ritesh Thawkar, Wafa Alghallabi, Omkar Thawakar, Fahad Shahbaz Khan, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer
Understanding historical and cultural artifacts demands human expertise and
advanced computational techniques, yet the process remains complex and
time-intensive. While large multimodal models offer promising support, their
evaluation and improvement require a standardized benchmark. To address this,
we introduce TimeTravel, a benchmark of 10,250 expert-verified samples spanning
266 distinct cultures across 10 major historical regions. Designed for
AI-driven analysis of manuscripts, artworks, inscriptions, and archaeological
discoveries, TimeTravel provides a structured dataset and robust evaluation
framework to assess AI models' capabilities in classification, interpretation,
and historical comprehension. By integrating AI with historical research,
TimeTravel fosters AI-powered tools for historians, archaeologists,
researchers, and cultural tourists to extract valuable insights while ensuring
technology contributes meaningfully to historical discovery and cultural
heritage preservation. We evaluate contemporary AI models on TimeTravel,
highlighting their strengths and identifying areas for improvement. Our goal is
to establish AI as a reliable partner in preserving cultural heritage, ensuring
that technological advancements contribute meaningfully to historical
discovery. Our code is available at:
\url{https://github.com/mbzuai-oryx/TimeTravel}.
comment: 4 pages, 6 figures
☆ Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework
Yuming Yang, Jiang Zhong, Li Jin, Jingwang Huang, Jingpeng Gao, Qing Liu, Yang Bai, Jingyuan Zhang, Rui Jiang, Kaiwen Wei
Multimodal Retrieval-Augmented Generation (MRAG) enhances reasoning
capabilities by integrating external knowledge. However, existing benchmarks
primarily focus on simple image-text interactions, overlooking complex visual
formats like charts that are prevalent in real-world applications. In this
work, we introduce a novel task, Chart-based MRAG, to address this limitation.
To semi-automatically generate high-quality evaluation samples, we propose
CHARt-based document question-answering GEneration (CHARGE), a framework that
produces evaluation data through structured keypoint extraction, crossmodal
verification, and keypoint-based generation. By combining CHARGE with expert
validation, we construct Chart-MRAG Bench, a comprehensive benchmark for
chart-based MRAG evaluation, featuring 4,738 question-answering pairs across 8
domains from real-world documents. Our evaluation reveals three critical
limitations in current approaches: (1) unified multimodal embedding retrieval
methods struggles in chart-based scenarios, (2) even with ground-truth
retrieval, state-of-the-art MLLMs achieve only 58.19% Correctness and 73.87%
Coverage scores, and (3) MLLMs demonstrate consistent text-over-visual modality
bias during Chart-based MRAG reasoning. The CHARGE and Chart-MRAG Bench are
released at https://github.com/Nomothings/CHARGE.git.
☆ Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
Yue Yang, Ajay Patel, Matt Deitke, Tanmay Gupta, Luca Weihs, Andrew Head, Mark Yatskar, Chris Callison-Burch, Ranjay Krishna, Aniruddha Kembhavi, Christopher Clark
Reasoning about images with rich text, such as charts and documents, is a
critical application of vision-language models (VLMs). However, VLMs often
struggle in these domains due to the scarcity of diverse text-rich
vision-language data. To address this challenge, we present CoSyn, a framework
that leverages the coding capabilities of text-only large language models
(LLMs) to automatically create synthetic text-rich multimodal data. Given input
text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts
an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic
images. With the underlying code as textual representations of the synthetic
images, CoSyn can generate high-quality instruction-tuning data, again relying
on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K
images and 2.7M rows of vision-language instruction-tuning data. Comprehensive
experiments on seven benchmarks demonstrate that models trained on our
synthetic data achieve state-of-the-art performance among competitive
open-source models, including Llama 3.2, and surpass proprietary models such as
GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing
data, enabling VLMs to ground information within input images, showcasing its
potential for developing multimodal agents capable of acting in real-world
environments.
comment: 20 pages, 19 figures, 9 tables, website:
https://yueyang1996.github.io/cosyn/
☆ Dynamic Concepts Personalization from Single Videos
Rameen Abdal, Or Patashnik, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman
Personalizing generative text-to-image models has seen remarkable progress,
but extending this personalization to text-to-video models presents unique
challenges. Unlike static concepts, personalizing text-to-video models has the
potential to capture dynamic concepts, i.e., entities defined not only by their
appearance but also by their motion. In this paper, we introduce
Set-and-Sequence, a novel framework for personalizing Diffusion Transformers
(DiTs)-based generative video models with dynamic concepts. Our approach
imposes a spatio-temporal weight space within an architecture that does not
explicitly separate spatial and temporal features. This is achieved in two key
stages. First, we fine-tune Low-Rank Adaptation (LoRA) layers using an
unordered set of frames from the video to learn an identity LoRA basis that
represents the appearance, free from temporal interference. In the second
stage, with the identity LoRAs frozen, we augment their coefficients with
Motion Residuals and fine-tune them on the full video sequence, capturing
motion dynamics. Our Set-and-Sequence framework results in a spatio-temporal
weight space that effectively embeds dynamic concepts into the video model's
output domain, enabling unprecedented editability and compositionality while
setting a new benchmark for personalizing dynamic concepts.
comment: Webpage: https://snap-research.github.io/dynamic_concepts/
☆ LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models
Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li
Existing Large Vision-Language Models (LVLMs) can process inputs with context
lengths up to 128k visual and text tokens, yet they struggle to generate
coherent outputs beyond 1,000 words. We find that the primary limitation is the
absence of long output examples during supervised fine-tuning (SFT). To tackle
this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158
examples, each with multiple input images, an instruction, and corresponding
outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that
maintain high-fidelity to the input images, we employ Direct Preference
Optimization (DPO) to the SFT model. Given the high cost of collecting human
feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which
breaks long outputs into segments and uses iterative corrections to form
preference pairs with the original outputs. Additionally, we develop
MMLongBench-Write, a benchmark featuring six tasks to evaluate the
long-generation capabilities of VLMs. Our 7B parameter model, trained with
LongWriter-V-22k and IterDPO, achieves impressive performance on this
benchmark, outperforming larger proprietary models like GPT-4o. Code and data:
https://github.com/THU-KEG/LongWriter-V
☆ Improving the Diffusability of Autoencoders
Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, Aliaksandr Siarohin
Latent diffusion models have emerged as the leading approach for generating
high-quality images and videos, utilizing compressed latent representations to
reduce the computational burden of the diffusion process. While recent
advancements have primarily focused on scaling diffusion backbones and
improving autoencoder reconstruction quality, the interaction between these
components has received comparatively less attention. In this work, we perform
a spectral analysis of modern autoencoders and identify inordinate
high-frequency components in their latent spaces, which are especially
pronounced in the autoencoders with a large bottleneck channel size. We
hypothesize that this high-frequency component interferes with the
coarse-to-fine nature of the diffusion synthesis process and hinders the
generation quality. To mitigate the issue, we propose scale equivariance: a
simple regularization strategy that aligns latent and RGB spaces across
frequencies by enforcing scale equivariance in the decoder. It requires minimal
code changes and only up to 20K autoencoder fine-tuning steps, yet
significantly improves generation quality, reducing FID by 19% for image
generation on ImageNet-1K 256x256 and FVD by at least 44% for video generation
on Kinetics-700 17x256x256.
comment: 26 pages, 22 figures, 9 tables
☆ Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison
Visual Question Answering (VQA) has emerged as a pivotal task in the
intersection of computer vision and natural language processing, requiring
models to understand and reason about visual content in response to natural
language questions. Analyzing VQA datasets is essential for developing robust
models that can handle the complexities of multimodal reasoning. Several
approaches have been developed to examine these datasets, each offering
distinct perspectives on question diversity, answer distribution, and
visual-textual correlations. Despite significant progress, existing VQA models
face challenges related to dataset bias, limited model complexity, commonsense
reasoning gaps, rigid evaluation methods, and generalization to real world
scenarios. This paper presents a comprehensive comparative study of five
advanced VQA models: ABC-CNN, KICNLE, Masked Vision and Language Modeling,
BLIP-2, and OFA, each employing distinct methodologies to address these
challenges.
comment: 8 pages, No figures
☆ FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis
Fadillah Maani, Numan Saeed, Tausifa Saleem, Zaid Farooq, Hussain Alasmawi, Werner Diehl, Ameera Mohammad, Gareth Waring, Saudabi Valappi, Leanne Bricker, Mohammad Yaqub
Foundation models are becoming increasingly effective in the medical domain,
offering pre-trained models on large datasets that can be readily adapted for
downstream tasks. Despite progress, fetal ultrasound images remain a
challenging domain for foundation models due to their inherent complexity,
often requiring substantial additional training and facing limitations due to
the scarcity of paired multimodal data. To overcome these challenges, here we
introduce FetalCLIP, a vision-language foundation model capable of generating
universal representation of fetal ultrasound images. FetalCLIP was pre-trained
using a multimodal learning approach on a diverse dataset of 210,035 fetal
ultrasound images paired with text. This represents the largest paired dataset
of its kind used for foundation model development to date. This unique training
approach allows FetalCLIP to effectively learn the intricate anatomical
features present in fetal ultrasound images, resulting in robust
representations that can be used for a variety of downstream applications. In
extensive benchmarking across a range of key fetal ultrasound applications,
including classification, gestational age estimation, congenital heart defect
(CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all
baselines while demonstrating remarkable generalizability and strong
performance even with limited labeled data. We plan to release the FetalCLIP
model publicly for the benefit of the broader scientific community.
☆ AVD2: Accident Video Diffusion for Accident Video Description ICRA 2025
Traffic accidents present complex challenges for autonomous driving, often
featuring unpredictable scenarios that hinder accurate system interpretation
and responses.Nonetheless, prevailing methodologies fall short in elucidating
the causes of accidents and proposing preventive measures due to the paucity of
training data specific to accident scenarios.In this work, we introduce AVD2
(Accident Video Diffusion for Accident Video Description), a novel framework
that enhances accident scene understanding by generating accident videos that
aligned with detailed natural language descriptions and reasoning, resulting in
the contributed EMM-AU (Enhanced Multi-Modal Accident Video Understanding)
dataset. Empirical results reveal that the integration of the EMM-AU dataset
establishes state-of-the-art performance across both automated metrics and
human evaluations, markedly advancing the domains of accident analysis and
prevention. Project resources are available at https://an-answer-tree.github.io
comment: ICRA 2025, Project Page: https://an-answer-tree.github.io/
☆ A Survey on Text-Driven 360-Degree Panorama Generation
The advent of text-driven 360-degree panorama generation, enabling the
synthesis of 360-degree panoramic images directly from textual descriptions,
marks a transformative advancement in immersive visual content creation. This
innovation significantly simplifies the traditionally complex process of
producing such content. Recent progress in text-to-image diffusion models has
accelerated the rapid development in this emerging field. This survey presents
a comprehensive review of text-driven 360-degree panorama generation, offering
an in-depth analysis of state-of-the-art algorithms and their expanding
applications in 360-degree 3D scene generation. Furthermore, we critically
examine current limitations and propose promising directions for future
research. A curated project page with relevant resources and research papers is
available at https://littlewhitesea.github.io/Text-Driven-Pano-Gen/.
☆ Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration
Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, Yuefan Wang, Huaicheng Zhou, Wenshuo Feng, Jiacheng Liu, Siteng Huang, Donglin Wang
This paper addresses the limitations of current humanoid robot control
frameworks, which primarily rely on reactive mechanisms and lack autonomous
interaction capabilities due to data scarcity. We propose Humanoid-VLA, a novel
framework that integrates language understanding, egocentric scene perception,
and motion control, enabling universal humanoid control. Humanoid-VLA begins
with language-motion pre-alignment using non-egocentric human motion datasets
paired with textual descriptions, allowing the model to learn universal motion
patterns and action semantics. We then incorporate egocentric visual context
through a parameter efficient video-conditioned fine-tuning, enabling
context-aware motion generation. Furthermore, we introduce a self-supervised
data augmentation strategy that automatically generates pseudoannotations
directly derived from motion data. This process converts raw motion sequences
into informative question-answer pairs, facilitating the effective use of
large-scale unlabeled video data. Built upon whole-body control architectures,
extensive experiments show that Humanoid-VLA achieves object interaction and
environment exploration tasks with enhanced contextual awareness, demonstrating
a more human-like capacity for adaptive and intelligent engagement.
☆ RendBEV: Semantic Novel View Synthesis for Self-Supervised Bird's Eye View Segmentation WACV 2025
Bird's Eye View (BEV) semantic maps have recently garnered a lot of attention
as a useful representation of the environment to tackle assisted and autonomous
driving tasks. However, most of the existing work focuses on the fully
supervised setting, training networks on large annotated datasets. In this
work, we present RendBEV, a new method for the self-supervised training of BEV
semantic segmentation networks, leveraging differentiable volumetric rendering
to receive supervision from semantic perspective views computed by a 2D
semantic segmentation model. Our method enables zero-shot BEV semantic
segmentation, and already delivers competitive results in this challenging
setting. When used as pretraining to then fine-tune on labeled BEV
ground-truth, our method significantly boosts performance in low-annotation
regimes, and sets a new state of the art when fine-tuning on all available
labels.
comment: Accepted at WACV 2025
☆ Structurally Disentangled Feature Fields Distillation for 3D Understanding and Editing
Recent work has demonstrated the ability to leverage or distill pre-trained
2D features obtained using large pre-trained 2D models into 3D features,
enabling impressive 3D editing and understanding capabilities using only 2D
supervision. Although impressive, models assume that 3D features are captured
using a single feature field and often make a simplifying assumption that
features are view-independent. In this work, we propose instead to capture 3D
features using multiple disentangled feature fields that capture different
structural components of 3D features involving view-dependent and
view-independent components, which can be learned from 2D feature supervision
only. Subsequently, each element can be controlled in isolation, enabling
semantic and structural understanding and editing capabilities. For instance,
using a user click, one can segment 3D features corresponding to a given object
and then segment, edit, or remove their view-dependent (reflective) properties.
We evaluate our approach on the task of 3D segmentation and demonstrate a set
of novel understanding and editing tasks.
☆ SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, Xiaohua Zhai
We introduce SigLIP 2, a family of new multilingual vision-language encoders
that build on the success of the original SigLIP. In this second iteration, we
extend the original image-text training objective with several prior,
independently developed techniques into a unified recipe -- this includes
captioning-based pretraining, self-supervised losses (self-distillation, masked
prediction) and online data curation. With these changes, SigLIP 2 models
outperform their SigLIP counterparts at all model scales in core capabilities,
including zero-shot classification, image-text retrieval, and transfer
performance when extracting visual representations for Vision-Language Models
(VLMs). Furthermore, the new training recipe leads to significant improvements
on localization and dense prediction tasks. We also train variants which
support multiple resolutions and preserve the input's native aspect ratio.
Finally, we train on a more diverse data-mixture that includes de-biasing
techniques, leading to much better multilingual understanding and improved
fairness. To allow users to trade off inference cost with performance, we
release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M),
and g (1B).
comment: Model checkpoints are available at
https://github.com/google-research/big_vision/tree/main/big_vision/configs/proj/image_text/README_siglip2.md
☆ ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting
Efficient and privacy-preserving multimodal interaction is essential as AR,
VR, and modern smartphones with powerful cameras become primary interfaces for
human-computer communication. Existing powerful large vision-language models
(VLMs) enabling multimodal interaction often rely on cloud-based processing,
raising significant concerns about (1) visual privacy by transmitting sensitive
vision data to servers, and (2) their limited real-time, on-device usability.
This paper explores Visual Instruction Rewriting, a novel approach that
transforms multimodal instructions into text-only commands, allowing seamless
integration of lightweight on-device instruction rewriter VLMs (250M
parameters) with existing conversational AI systems, enhancing vision data
privacy. To achieve this, we present a dataset of over 39,000 examples across
14 domains and develop a compact VLM, pretrained on image captioning datasets
and fine-tuned for instruction rewriting. Experimental results, evaluated
through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic
parsing analysis, demonstrate that even a quantized version of the model
(<500MB storage footprint) can achieve effective instruction rewriting, thus
enabling privacy-focused, multimodal AI applications.
comment: 12 pages, 7 figures, 3 tables
☆ DC-ControlNet: Decoupling Inter- and Intra-Element Conditions in Image Generation with Diffusion Models
In this paper, we introduce DC (Decouple)-ControlNet, a highly flexible and
precisely controllable framework for multi-condition image generation. The core
idea behind DC-ControlNet is to decouple control conditions, transforming
global control into a hierarchical system that integrates distinct elements,
contents, and layouts. This enables users to mix these individual conditions
with greater flexibility, leading to more efficient and accurate image
generation control. Previous ControlNet-based models rely solely on global
conditions, which affect the entire image and lack the ability of element- or
region-specific control. This limitation reduces flexibility and can cause
condition misunderstandings in multi-conditional image generation. To address
these challenges, we propose both intra-element and Inter-element Controllers
in DC-ControlNet. The Intra-Element Controller handles different types of
control signals within individual elements, accurately describing the content
and layout characteristics of the object. For interactions between elements, we
introduce the Inter-Element Controller, which accurately handles multi-element
interactions and occlusion based on user-defined relationships. Extensive
evaluations show that DC-ControlNet significantly outperforms existing
ControlNet models and Layout-to-Image generative models in terms of control
flexibility and precision in multi-condition control.
☆ Harnessing PDF Data for Improving Japanese Large Multimodal Models
Large Multimodal Models (LMMs) have demonstrated strong performance in
English, but their effectiveness in Japanese remains limited due to the lack of
high-quality training data. Current Japanese LMMs often rely on translated
English datasets, restricting their ability to capture Japan-specific cultural
knowledge. To address this, we explore the potential of Japanese PDF data as a
training resource, an area that remains largely underutilized. We introduce a
fully automated pipeline that leverages pretrained models to extract image-text
pairs from PDFs through layout analysis, OCR, and vision-language pairing,
removing the need for manual annotation. Additionally, we construct instruction
data from extracted image-text pairs to enrich the training data. To evaluate
the effectiveness of PDF-derived data, we train Japanese LMMs and assess their
performance on the Japanese LMM Benchmark. Our results demonstrate substantial
improvements, with performance gains ranging from 3.9% to 13.8% on Heron-Bench.
Further analysis highlights the impact of PDF-derived data on various factors,
such as model size and language models, reinforcing its value as a multimodal
resource for Japanese LMMs. We plan to make the source code and data publicly
available upon acceptance.
comment: 15 pages, 8 figures
☆ Sculpting [CLS] Features for Pre-Trained Model-Based Class-Incremental Learning
Class-incremental learning requires models to continually acquire knowledge
of new classes without forgetting old ones. Although pre-trained models have
demonstrated strong performance in class-incremental learning, they remain
susceptible to catastrophic forgetting when learning new concepts. Excessive
plasticity in the models breaks generalizability and causes forgetting, while
strong stability results in insufficient adaptation to new classes. This
necessitates effective adaptation with minimal modifications to preserve the
general knowledge of pre-trained models. To address this challenge, we first
introduce a new parameter-efficient fine-tuning module 'Learn and Calibrate',
or LuCA, designed to acquire knowledge through an adapter-calibrator couple,
enabling effective adaptation with well-refined feature representations.
Second, for each learning session, we deploy a sparse LuCA module on top of the
last token just before the classifier, which we refer to as 'Token-level Sparse
Calibration and Adaptation', or TOSCA. This strategic design improves the
orthogonality between the modules and significantly reduces both training and
inference complexity. By leaving the generalization capabilities of the
pre-trained models intact and adapting exclusively via the last token, our
approach achieves a harmonious balance between stability and plasticity.
Extensive experiments demonstrate TOSCA's state-of-the-art performance while
introducing ~8 times fewer parameters compared to prior methods.
☆ MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders
Maya Varma, Ashwin Kumar, Rogier van der Sluijs, Sophie Ostmeier, Louis Blankemeier, Pierre Chambon, Christian Bluethgen, Jip Prince, Curtis Langlotz, Akshay Chaudhari
Medical images are acquired at high resolutions with large fields of view in
order to capture fine-grained features necessary for clinical decision-making.
Consequently, training deep learning models on medical images can incur large
computational costs. In this work, we address the challenge of downsizing
medical images in order to improve downstream computational efficiency while
preserving clinically-relevant features. We introduce MedVAE, a family of six
large-scale 2D and 3D autoencoders capable of encoding medical images as
downsized latent representations and decoding latent representations back to
high-resolution images. We train MedVAE autoencoders using a novel two-stage
training approach with 1,052,730 medical images. Across diverse tasks obtained
from 20 medical image datasets, we demonstrate that (1) utilizing MedVAE latent
representations in place of high-resolution images when training downstream
models can lead to efficiency benefits (up to 70x improvement in throughput)
while simultaneously preserving clinically-relevant features and (2) MedVAE can
decode latent representations back to high-resolution images with high
fidelity. Our work demonstrates that large-scale, generalizable autoencoders
can help address critical efficiency challenges in the medical domain. Our code
is available at https://github.com/StanfordMIMI/MedVAE.
☆ YOLOv12: A Breakdown of the Key Architectural Features
This paper presents an architectural analysis of YOLOv12, a significant
advancement in single-stage, real-time object detection building upon the
strengths of its predecessors while introducing key improvements. The model
incorporates an optimised backbone (R-ELAN), 7x7 separable convolutions, and
FlashAttention-driven area-based attention, improving feature extraction,
enhanced efficiency, and robust detections. With multiple model variants,
similar to its predecessors, YOLOv12 offers scalable solutions for both
latency-sensitive and high-accuracy applications. Experimental results manifest
consistent gains in mean average precision (mAP) and inference speed, making
YOLOv12 a compelling choice for applications in autonomous systems, security,
and real-time analytics. By achieving an optimal balance between computational
efficiency and performance, YOLOv12 sets a new benchmark for real-time computer
vision, facilitating deployment across diverse hardware platforms, from edge
devices to high-performance clusters.
☆ Multi-dataset synergistic in supervised learning to pre-label structural components in point clouds from shell construction scenes
The significant effort required to annotate data for new training datasets
hinders computer vision research and machine learning in the construction
industry. This work explores adapting standard datasets and the latest
transformer model architectures for point cloud semantic segmentation in the
context of shell construction sites. Unlike common approaches focused on object
segmentation of building interiors and furniture, this study addressed the
challenges of segmenting complex structural components in Architecture,
Engineering, and Construction (AEC). We establish a baseline through supervised
training and a custom validation dataset, evaluate the cross-domain inference
with large-scale indoor datasets, and utilize transfer learning to maximize
segmentation performance with minimal new data. The findings indicate that with
minimal fine-tuning, pre-trained transformer architectures offer an effective
strategy for building component segmentation. Our results are promising for
automating the annotation of new, previously unseen data when creating larger
training resources and for the segmentation of frequently recurring objects.
comment: 18 pages, 8 figures, 7 tables
☆ CDGS: Confidence-Aware Depth Regularization for 3D Gaussian Splatting
3D Gaussian Splatting (3DGS) has shown significant advantages in novel view
synthesis (NVS), particularly in achieving high rendering speeds and
high-quality results. However, its geometric accuracy in 3D reconstruction
remains limited due to the lack of explicit geometric constraints during
optimization. This paper introduces CDGS, a confidence-aware depth
regularization approach developed to enhance 3DGS. We leverage multi-cue
confidence maps of monocular depth estimation and sparse Structure-from-Motion
depth to adaptively adjust depth supervision during the optimization process.
Our method demonstrates improved geometric detail preservation in early
training stages and achieves competitive performance in both NVS quality and
geometric accuracy. Experiments on the publicly available Tanks and Temples
benchmark dataset show that our method achieves more stable convergence
behavior and more accurate geometric reconstruction results, with improvements
of up to 2.31 dB in PSNR for NVS and consistently lower geometric errors in
M3C2 distance metrics. Notably, our method reaches comparable F-scores to the
original 3DGS with only 50% of the training iterations. We expect this work
will facilitate the development of efficient and accurate 3D reconstruction
systems for real-world applications such as digital twin creation, heritage
preservation, or forestry applications.
☆ BP-SGCN: Behavioral Pseudo-Label Informed Sparse Graph Convolution Network for Pedestrian and Heterogeneous Trajectory Prediction
Trajectory prediction allows better decision-making in applications of
autonomous vehicles or surveillance by predicting the short-term future
movement of traffic agents. It is classified into pedestrian or heterogeneous
trajectory prediction. The former exploits the relatively consistent behavior
of pedestrians, but is limited in real-world scenarios with heterogeneous
traffic agents such as cyclists and vehicles. The latter typically relies on
extra class label information to distinguish the heterogeneous agents, but such
labels are costly to annotate and cannot be generalized to represent different
behaviors within the same class of agents. In this work, we introduce the
behavioral pseudo-labels that effectively capture the behavior distributions of
pedestrians and heterogeneous agents solely based on their motion features,
significantly improving the accuracy of trajectory prediction. To implement the
framework, we propose the Behavioral Pseudo-Label Informed Sparse Graph
Convolution Network (BP-SGCN) that learns pseudo-labels and informs to a
trajectory predictor. For optimization, we propose a cascaded training scheme,
in which we first learn the pseudo-labels in an unsupervised manner, and then
perform end-to-end fine-tuning on the labels in the direction of increasing the
trajectory prediction accuracy. Experiments show that our pseudo-labels
effectively model different behavior clusters and improve trajectory
prediction. Our proposed BP-SGCN outperforms existing methods using both
pedestrian (ETH/UCY, pedestrian-only SDD) and heterogeneous agent datasets
(SDD, Argoverse 1).
☆ MAGO-SP: Detection and Correction of Water-Fat Swaps in Magnitude-Only VIBE MRI
Robert Graf, Hendrik Möller, Sophie Starck, Matan Atad, Philipp Braun, Jonathan Stelter, Annette Peters, Lilian Krist, Stefan N. Willich, Henry Völzke, Robin Bülow, Klaus Berger, Tobias Pischon, Thoralf Niendorf, Johannes Paetzold, Dimitrios Karampinos, Daniel Rueckert, Jan Kirschke
Volume Interpolated Breath-Hold Examination (VIBE) MRI generates images
suitable for water and fat signal composition estimation. While the two-point
VIBE provides water-fat-separated images, the six-point VIBE allows estimation
of the effective transversal relaxation rate R2* and the proton density fat
fraction (PDFF), which are imaging markers for health and disease. Ambiguity
during signal reconstruction can lead to water-fat swaps. This shortcoming
challenges the application of VIBE-MRI for automated PDFF analyses of
large-scale clinical data and of population studies. This study develops an
automated pipeline to detect and correct water-fat swaps in
non-contrast-enhanced VIBE images. Our three-step pipeline begins with training
a segmentation network to classify volumes as "fat-like" or "water-like," using
synthetic water-fat swaps generated by merging fat and water volumes with
Perlin noise. Next, a denoising diffusion image-to-image network predicts water
volumes as signal priors for correction. Finally, we integrate this prior into
a physics-constrained model to recover accurate water and fat signals. Our
approach achieves a < 1% error rate in water-fat swap detection for a 6-point
VIBE. Notably, swaps disproportionately affect individuals in the Underweight
and Class 3 Obesity BMI categories. Our correction algorithm ensures accurate
solution selection in chemical phase MRIs, enabling reliable PDFF estimation.
This forms a solid technical foundation for automated large-scale population
imaging analysis.
☆ NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization
Image geo-localization is the task of predicting the specific location of an
image and requires complex reasoning across visual, geographical, and cultural
contexts. While prior Vision Language Models (VLMs) have the best accuracy at
this task, there is a dearth of high-quality datasets and models for analytical
reasoning. We first create NaviClues, a high-quality dataset derived from
GeoGuessr, a popular geography game, to supply examples of expert reasoning
from language. Using this dataset, we present Navig, a comprehensive image
geo-localization framework integrating global and fine-grained image
information. By reasoning with language, Navig reduces the average distance
error by 14% compared to previous state-of-the-art models while requiring fewer
than 1000 training samples. Our dataset and code are available at
https://github.com/SparrowZheyuan18/Navig/.
☆ Monocular Depth Estimation and Segmentation for Transparent Object with Iterative Semantic and Geometric Fusion ICRA
Transparent object perception is indispensable for numerous robotic tasks.
However, accurately segmenting and estimating the depth of transparent objects
remain challenging due to complex optical properties. Existing methods
primarily delve into only one task using extra inputs or specialized sensors,
neglecting the valuable interactions among tasks and the subsequent refinement
process, leading to suboptimal and blurry predictions. To address these issues,
we propose a monocular framework, which is the first to excel in both
segmentation and depth estimation of transparent objects, with only a
single-image input. Specifically, we devise a novel semantic and geometric
fusion module, effectively integrating the multi-scale information between
tasks. In addition, drawing inspiration from human perception of objects, we
further incorporate an iterative strategy, which progressively refines initial
features for clearer results. Experiments on two challenging synthetic and
real-world datasets demonstrate that our model surpasses state-of-the-art
monocular, stereo, and multi-view methods by a large margin of about
38.8%-46.2% with only a single RGB input. Codes and models are publicly
available at https://github.com/L-J-Yuan/MODEST.
comment: Accepted by ICRA(2025). The code is accessible through:
https://github.com/L-J-Yuan/MODEST
☆ Vision Foundation Models in Medical Image Analysis: Advances and Challenges
The rapid development of Vision Foundation Models (VFMs), particularly Vision
Transformers (ViT) and Segment Anything Model (SAM), has sparked significant
advances in the field of medical image analysis. These models have demonstrated
exceptional capabilities in capturing long-range dependencies and achieving
high generalization in segmentation tasks. However, adapting these large models
to medical image analysis presents several challenges, including domain
differences between medical and natural images, the need for efficient model
adaptation strategies, and the limitations of small-scale medical datasets.
This paper reviews the state-of-the-art research on the adaptation of VFMs to
medical image segmentation, focusing on the challenges of domain adaptation,
model compression, and federated learning. We discuss the latest developments
in adapter-based improvements, knowledge distillation techniques, and
multi-scale contextual feature modeling, and propose future directions to
overcome these bottlenecks. Our analysis highlights the potential of VFMs,
along with emerging methodologies such as federated learning and model
compression, to revolutionize medical image analysis and enhance clinical
applications. The goal of this work is to provide a comprehensive overview of
current approaches and suggest key areas for future research that can drive the
next wave of innovation in medical image segmentation.
comment: 17 pages, 1 figure
☆ Self-supervised Monocular Depth Estimation Robust to Reflective Surface Leveraged by Triplet Mining ICLR 2025
Self-supervised monocular depth estimation (SSMDE) aims to predict the dense
depth map of a monocular image, by learning depth from RGB image sequences,
eliminating the need for ground-truth depth labels. Although this approach
simplifies data acquisition compared to supervised methods, it struggles with
reflective surfaces, as they violate the assumptions of Lambertian reflectance,
leading to inaccurate training on such surfaces. To tackle this problem, we
propose a novel training strategy for an SSMDE by leveraging triplet mining to
pinpoint reflective regions at the pixel level, guided by the camera geometry
between different viewpoints. The proposed reflection-aware triplet mining loss
specifically penalizes the inappropriate photometric error minimization on the
localized reflective regions while preserving depth accuracy in non-reflective
areas. We also incorporate a reflection-aware knowledge distillation method
that enables a student model to selectively learn the pixel-level knowledge
from reflective and non-reflective regions. This results in robust depth
estimation across areas. Evaluation results on multiple datasets demonstrate
that our method effectively enhances depth quality on reflective surfaces and
outperforms state-of-the-art SSMDE baselines.
comment: Accepted at ICLR 2025
☆ Learning Temporal 3D Semantic Scene Completion via Optical Flow Guidance
3D Semantic Scene Completion (SSC) provides comprehensive scene geometry and
semantics for autonomous driving perception, which is crucial for enabling
accurate and reliable decision-making. However, existing SSC methods are
limited to capturing sparse information from the current frame or naively
stacking multi-frame temporal features, thereby failing to acquire effective
scene context. These approaches ignore critical motion dynamics and struggle to
achieve temporal consistency. To address the above challenges, we propose a
novel temporal SSC method FlowScene: Learning Temporal 3D Semantic Scene
Completion via Optical Flow Guidance. By leveraging optical flow, FlowScene can
integrate motion, different viewpoints, occlusions, and other contextual cues,
thereby significantly improving the accuracy of 3D scene completion.
Specifically, our framework introduces two key components: (1) a Flow-Guided
Temporal Aggregation module that aligns and aggregates temporal features using
optical flow, capturing motion-aware context and deformable structures; and (2)
an Occlusion-Guided Voxel Refinement module that injects occlusion masks and
temporally aggregated features into 3D voxel space, adaptively refining voxel
representations for explicit geometric modeling. Experimental results
demonstrate that FlowScene achieves state-of-the-art performance on the
SemanticKITTI and SSCBench-KITTI-360 benchmarks.
☆ A Mobile Robotic Approach to Autonomous Surface Scanning in Legal Medicine
Sarah Grube, Sarah Latus, Martin Fischer, Vidas Raudonis, Axel Heinemann, Benjamin Ondruschka, Alexander Schlaefer
Purpose: Comprehensive legal medicine documentation includes both an internal
but also an external examination of the corpse. Typically, this documentation
is conducted manually during conventional autopsy. A systematic digital
documentation would be desirable, especially for the external examination of
wounds, which is becoming more relevant for legal medicine analysis. For this
purpose, RGB surface scanning has been introduced. While a manual full surface
scan using a handheld camera is timeconsuming and operator dependent, floor or
ceiling mounted robotic systems require substantial space and a dedicated room.
Hence, we consider whether a mobile robotic system can be used for external
documentation. Methods: We develop a mobile robotic system that enables
full-body RGB-D surface scanning. Our work includes a detailed configuration
space analysis to identify the environmental parameters that need to be
considered to successfully perform a surface scan. We validate our findings
through an experimental study in the lab and demonstrate the system's
application in a legal medicine environment. Results: Our configuration space
analysis shows that a good trade-off between coverage and time is reached with
three robot base positions, leading to a coverage of 94.96 %. Experiments
validate the effectiveness of the system in accurately capturing body surface
geometry with an average surface coverage of 96.90 +- 3.16 % and 92.45 +- 1.43
% for a body phantom and actual corpses, respectively. Conclusion: This work
demonstrates the potential of a mobile robotic system to automate RGB-D surface
scanning in legal medicine, complementing the use of post-mortem CT scans for
inner documentation. Our results indicate that the proposed system can
contribute to more efficient and autonomous legal medicine documentation,
reducing the need for manual intervention.
comment: Submitted and accepted for presentation at CARS 2025. This preprint
has not undergone peer review or post-submission revisions. The final version
of this work will appear in the official CARS 2025 proceedings
☆ PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models
Large Vision-Language Models (LVLMs) have demonstrated remarkable
capabilities across a range of multimodal tasks. However, their inference
efficiency is constrained by the large number of visual tokens processed during
decoding. To address this challenge, we propose Per-Layer Per-Head Vision Token
Pruning (PLPHP), a two-level fine-grained pruning method including Layer-Level
Retention Rate Allocation and Head-Level Vision Token Pruning. Motivated by the
Vision Token Re-attention phenomenon across decoder layers, we dynamically
adjust token retention rates layer by layer. Layers that exhibit stronger
attention to visual information preserve more vision tokens, while layers with
lower vision attention are aggressively pruned. Furthermore, PLPHP applies
pruning at the attention head level, enabling different heads within the same
layer to independently retain critical context. Experiments on multiple
benchmarks demonstrate that PLPHP delivers an 18% faster decoding speed and
reduces the Key-Value Cache (KV Cache) size by over 50%, all at the cost of
0.46% average performance drop, while also achieving notable performance
improvements in multi-image tasks. These results highlight the effectiveness of
fine-grained token pruning and contribute to advancing the efficiency and
scalability of LVLMs. Our source code will be made publicly available.
comment: 12 pages, 8 figures
☆ LXLv2: Enhanced LiDAR Excluded Lean 3D Object Detection with Fusion of 4D Radar and Camera
As the previous state-of-the-art 4D radar-camera fusion-based 3D object
detection method, LXL utilizes the predicted image depth distribution maps and
radar 3D occupancy grids to assist the sampling-based image view
transformation. However, the depth prediction lacks accuracy and consistency,
and the concatenation-based fusion in LXL impedes the model robustness. In this
work, we propose LXLv2, where modifications are made to overcome the
limitations and improve the performance. Specifically, considering the position
error in radar measurements, we devise a one-to-many depth supervision strategy
via radar points, where the radar cross section (RCS) value is further
exploited to adjust the supervision area for object-level depth consistency.
Additionally, a channel and spatial attention-based fusion module named
CSAFusion is introduced to improve feature adaptiveness. Experimental results
on the View-of-Delft and TJ4DRadSet datasets show that the proposed LXLv2 can
outperform LXL in detection accuracy, inference speed and robustness,
demonstrating the effectiveness of the model.
comment: Accepted by IEEE Robotics and Automation Letters
☆ Nearshore Underwater Target Detection Meets UAV-borne Hyperspectral Remote Sensing: A Novel Hybrid-level Contrastive Learning Framework and Benchmark Dataset
UAV-borne hyperspectral remote sensing has emerged as a promising approach
for underwater target detection (UTD). However, its effectiveness is hindered
by spectral distortions in nearshore environments, which compromise the
accuracy of traditional hyperspectral UTD (HUTD) methods that rely on
bathymetric model. These distortions lead to significant uncertainty in target
and background spectra, challenging the detection process. To address this, we
propose the Hyperspectral Underwater Contrastive Learning Network (HUCLNet), a
novel framework that integrates contrastive learning with a self-paced learning
paradigm for robust HUTD in nearshore regions. HUCLNet extracts discriminative
features from distorted hyperspectral data through contrastive learning, while
the self-paced learning strategy selectively prioritizes the most informative
samples. Additionally, a reliability-guided clustering strategy enhances the
robustness of learned representations.To evaluate the method effectiveness, we
conduct a novel nearshore HUTD benchmark dataset, ATR2-HUTD, covering three
diverse scenarios with varying water types and turbidity, and target types.
Extensive experiments demonstrate that HUCLNet significantly outperforms
state-of-the-art methods. The dataset and code will be publicly available at:
https://github.com/qjh1996/HUTD
comment: 18pages,13figures
☆ CrossFuse: Learning Infrared and Visible Image Fusion by Cross-Sensor Top-K Vision Alignment and Beyond
Infrared and visible image fusion (IVIF) is increasingly applied in critical
fields such as video surveillance and autonomous driving systems. Significant
progress has been made in deep learning-based fusion methods. However, these
models frequently encounter out-of-distribution (OOD) scenes in real-world
applications, which severely impact their performance and reliability.
Therefore, addressing the challenge of OOD data is crucial for the safe
deployment of these models in open-world environments. Unlike existing
research, our focus is on the challenges posed by OOD data in real-world
applications and on enhancing the robustness and generalization of models. In
this paper, we propose an infrared-visible fusion framework based on Multi-View
Augmentation. For external data augmentation, Top-k Selective Vision Alignment
is employed to mitigate distribution shifts between datasets by performing
RGB-wise transformations on visible images. This strategy effectively
introduces augmented samples, enhancing the adaptability of the model to
complex real-world scenarios. Additionally, for internal data augmentation,
self-supervised learning is established using Weak-Aggressive Augmentation.
This enables the model to learn more robust and general feature representations
during the fusion process, thereby improving robustness and generalization.
Extensive experiments demonstrate that the proposed method exhibits superior
performance and robustness across various conditions and environments. Our
approach significantly enhances the reliability and stability of IVIF tasks in
practical applications.
comment: IEEE T-CSVT. We mainly discuss the out-of-distribution challenges in
infrared and visible image fusion
☆ Temporal Misalignment and Probabilistic Neurons
Spiking Neural Networks (SNNs) offer a more energy-efficient alternative to
Artificial Neural Networks (ANNs) by mimicking biological neural principles,
establishing them as a promising approach to mitigate the increasing energy
demands of large-scale neural models. However, fully harnessing the
capabilities of SNNs remains challenging due to their discrete signal
processing and temporal dynamics. ANN-SNN conversion has emerged as a practical
approach, enabling SNNs to achieve competitive performance on complex machine
learning tasks. In this work, we identify a phenomenon in the ANN-SNN
conversion framework, termed temporal misalignment, in which random spike
rearrangement across SNN layers leads to performance improvements. Based on
this observation, we introduce biologically plausible two-phase probabilistic
(TPP) spiking neurons, further enhancing the conversion process. We demonstrate
the advantages of our proposed method both theoretically and empirically
through comprehensive experiments on CIFAR-10/100, CIFAR10-DVS, and ImageNet
across a variety of architectures, achieving state-of-the-art results.
☆ Integrating Extra Modality Helps Segmentor Find Camouflaged Objects Well
Chengyu Fang, Chunming He, Longxiang Tang, Yuelin Zhang, Chenyang Zhu, Yuqi Shen, Chubin Chen, Guoxia Xu, Xiu Li
Camouflaged Object Segmentation (COS) remains a challenging problem due to
the subtle visual differences between camouflaged objects and backgrounds.
Owing to the exceedingly limited visual cues available from visible spectrum,
previous RGB single-modality approaches often struggle to achieve satisfactory
results, prompting the exploration of multimodal data to enhance detection
accuracy. In this work, we present UniCOS, a novel framework that effectively
leverages diverse data modalities to improve segmentation performance. UniCOS
comprises two key components: a multimodal segmentor, UniSEG, and a cross-modal
knowledge learning module, UniLearner. UniSEG employs a state space fusion
mechanism to integrate cross-modal features within a unified state space,
enhancing contextual understanding and improving robustness to integration of
heterogeneous data. Additionally, it includes a fusion-feedback mechanism that
facilitate feature extraction. UniLearner exploits multimodal data unrelated to
the COS task to improve the segmentation ability of the COS models by
generating pseudo-modal content and cross-modal semantic associations.
Extensive experiments demonstrate that UniSEG outperforms existing Multimodal
COS (MCOS) segmentors, regardless of whether real or pseudo-multimodal COS data
is available. Moreover, in scenarios where multimodal COS data is unavailable
but multimodal non-COS data is accessible, UniLearner effectively exploits
these data to enhance segmentation performance. Our code will be made publicly
available on \href{https://github.com/cnyvfang/UniCOS}{GitHub}.
comment: 12 pages, 5 figures, 6 tables
☆ Single-image Reflectance and Transmittance Estimation from Any Flatbed Scanner
Carlos Rodriguez-Pardo, David Pascual-Hernandez, Javier Rodriguez-Vazquez, Jorge Lopez-Moreno, Elena Garces
Flatbed scanners have emerged as promising devices for high-resolution,
single-image material capture. However, existing approaches assume very
specific conditions, such as uniform diffuse illumination, which are only
available in certain high-end devices, hindering their scalability and cost. In
contrast, in this work, we introduce a method inspired by intrinsic image
decomposition, which accurately removes both shading and specularity,
effectively allowing captures with any flatbed scanner. Further, we extend
previous work on single-image material reflectance capture with the estimation
of opacity and transmittance, critical components of full material appearance
(SVBSDF), improving the results for any material captured with a flatbed
scanner, at a very high resolution and accuracy
comment: Accepted to Computers & Graphics
☆ Exploiting Deblurring Networks for Radiance Fields
In this paper, we propose DeepDeblurRF, a novel radiance field deblurring
approach that can synthesize high-quality novel views from blurred training
views with significantly reduced training time. DeepDeblurRF leverages deep
neural network (DNN)-based deblurring modules to enjoy their deblurring
performance and computational efficiency. To effectively combine DNN-based
deblurring and radiance field construction, we propose a novel radiance field
(RF)-guided deblurring and an iterative framework that performs RF-guided
deblurring and radiance field construction in an alternating manner. Moreover,
DeepDeblurRF is compatible with various scene representations, such as voxel
grids and 3D Gaussians, expanding its applicability. We also present
BlurRF-Synth, the first large-scale synthetic dataset for training radiance
field deblurring frameworks. We conduct extensive experiments on both camera
motion blur and defocus blur, demonstrating that DeepDeblurRF achieves
state-of-the-art novel-view synthesis quality with significantly reduced
training time.
☆ Stochastic Resonance Improves the Detection of Low Contrast Images in Deep Learning Models
Stochastic resonance describes the utility of noise in improving the
detectability of weak signals in certain types of systems. It has been observed
widely in natural and engineered settings, but its utility in image
classification with rate-based neural networks has not been studied
extensively. In this analysis a simple LSTM recurrent neural network is trained
for digit recognition and classification. During the test phase, image contrast
is reduced to a point where the model fails to recognize the presence of a
stimulus. Controlled noise is added to partially recover classification
performance. The results indicate the presence of stochastic resonance in
rate-based recurrent neural networks.
comment: MSc Course Project
☆ Daily Land Surface Temperature Reconstruction in Landsat Cross-Track Areas Using Deep Ensemble Learning With Uncertainty Quantification
Many real-world applications rely on land surface temperature (LST) data at
high spatiotemporal resolution. In complex urban areas, LST exhibits
significant variations, fluctuating dramatically within and across city blocks.
Landsat provides high spatial resolution data at 100 meters but is limited by
long revisit time, with cloud cover further disrupting data collection. Here,
we propose DELAG, a deep ensemble learning method that integrates annual
temperature cycles and Gaussian processes, to reconstruct Landsat LST in
complex urban areas. Leveraging the cross-track characteristics and
dual-satellite operation of Landsat since 2021, we further enhance data
availability to 4 scenes every 16 days. We select New York City, London and
Hong Kong from three different continents as study areas. Experiments show that
DELAG successfully reconstructed LST in the three cities under clear-sky (RMSE
= 0.73-0.96 K) and heavily-cloudy (RMSE = 0.84-1.62 K) situations, superior to
existing methods. Additionally, DELAG can quantify uncertainty that enhances
LST reconstruction reliability. We further tested the reconstructed LST to
estimate near-surface air temperature, achieving results (RMSE = 1.48-2.11 K)
comparable to those derived from clear-sky LST (RMSE = 1.63-2.02 K). The
results demonstrate the successful reconstruction through DELAG and highlight
the broader applications of LST reconstruction for estimating accurate air
temperature. Our study thus provides a novel and practical method for Landsat
LST reconstruction, particularly suited for complex urban areas within Landsat
cross-track areas, taking one step toward addressing complex climate events at
high spatiotemporal resolution.
☆ ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model
Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Ran Cheng, Yaxin Peng, Chaomin Shen, Feifei Feng
Humans possess a unified cognitive ability to perceive, comprehend, and
interact with the physical world. Why can't large language models replicate
this holistic understanding? Through a systematic analysis of existing training
paradigms in vision-language-action models (VLA), we identify two key
challenges: spurious forgetting, where robot training overwrites crucial
visual-text alignments, and task interference, where competing control and
understanding tasks degrade performance when trained jointly. To overcome these
limitations, we propose ChatVLA, a novel framework featuring Phased Alignment
Training, which incrementally integrates multimodal data after initial control
mastery, and a Mixture-of-Experts architecture to minimize task interference.
ChatVLA demonstrates competitive performance on visual question-answering
datasets and significantly surpasses state-of-the-art vision-language-action
(VLA) methods on multimodal understanding benchmarks. Notably, it achieves a
six times higher performance on MMMU and scores 47.2% on MMStar with a more
parameter-efficient design than ECoT. Furthermore, ChatVLA demonstrates
superior performance on 25 real-world robot manipulation tasks compared to
existing VLA methods like OpenVLA. Our findings highlight the potential of our
unified framework for achieving both robust multimodal understanding and
effective robot control.
☆ Role of the Pretraining and the Adaptation data sizes for low-resource real-time MRI video segmentation ICASSP 2025
Real-time Magnetic Resonance Imaging (rtMRI) is frequently used in speech
production studies as it provides a complete view of the vocal tract during
articulation. This study investigates the effectiveness of rtMRI in analyzing
vocal tract movements by employing the SegNet and UNet models for Air-Tissue
Boundary (ATB)segmentation tasks. We conducted pretraining of a few base models
using increasing numbers of subjects and videos, to assess performance on two
datasets. First, consisting of unseen subjects with unseen videos from the same
data source, achieving 0.33% and 0.91% (Pixel-wise Classification Accuracy
(PCA) and Dice Coefficient respectively) better than its matched condition.
Second, comprising unseen videos from a new data source, where we obtained an
accuracy of 99.63% and 98.09% (PCA and Dice Coefficient respectively) of its
matched condition performance. Here, matched condition performance refers to
the performance of a model trained only on the test subjects which was set as a
benchmark for the other models. Our findings highlight the significance of
fine-tuning and adapting models with limited data. Notably, we demonstrated
that effective model adaptation can be achieved with as few as 15 rtMRI frames
from any new dataset.
comment: Accepted to ICASSP 2025
☆ Evaluating Precise Geolocation Inference Capabilities of Vision Language Models AAAI 2025
The prevalence of Vision-Language Models (VLMs) raises important questions
about privacy in an era where visual information is increasingly available.
While foundation VLMs demonstrate broad knowledge and learned capabilities, we
specifically investigate their ability to infer geographic location from
previously unseen image data. This paper introduces a benchmark dataset
collected from Google Street View that represents its global distribution of
coverage. Foundation models are evaluated on single-image geolocation
inference, with many achieving median distance errors of <300 km. We further
evaluate VLM "agents" with access to supplemental tools, observing up to a
30.6% decrease in distance error. Our findings establish that modern foundation
VLMs can act as powerful image geolocation tools, without being specifically
trained for this task. When coupled with increasing accessibility of these
models, our findings have greater implications for online privacy. We discuss
these risks, as well as future work in this area.
comment: AAAI 2025 Workshop DATASAFE
☆ MedFuncta: Modality-Agnostic Representations Based on Efficient Neural Fields
Recent research in medical image analysis with deep learning almost
exclusively focuses on grid- or voxel-based data representations. We challenge
this common choice by introducing MedFuncta, a modality-agnostic continuous
data representation based on neural fields. We demonstrate how to scale neural
fields from single instances to large datasets by exploiting redundancy in
medical signals and by applying an efficient meta-learning approach with a
context reduction scheme. We further address the spectral bias in commonly used
SIREN activations, by introducing an $\omega_0$-schedule, improving
reconstruction quality and convergence speed. We validate our proposed approach
on a large variety of medical signals of different dimensions and modalities
(1D: ECG; 2D: Chest X-ray, Retinal OCT, Fundus Camera, Dermatoscope, Colon
Histopathology, Cell Microscopy; 3D: Brain MRI, Lung CT) and successfully
demonstrate that we can solve relevant downstream tasks on these
representations. We additionally release a large-scale dataset of > 550k
annotated neural fields to promote research in this direction.
comment: Code and Dataset: https://github.com/pfriedri/medfuncta
☆ PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data
We introduce PhotoDoodle, a novel image editing framework designed to
facilitate photo doodling by enabling artists to overlay decorative elements
onto photographs. Photo doodling is challenging because the inserted elements
must appear seamlessly integrated with the background, requiring realistic
blending, perspective alignment, and contextual coherence. Additionally, the
background must be preserved without distortion, and the artist's unique style
must be captured efficiently from limited training data. These requirements are
not addressed by previous methods that primarily focus on global style transfer
or regional inpainting. The proposed method, PhotoDoodle, employs a two-stage
training strategy. Initially, we train a general-purpose image editing model,
OmniEditor, using large-scale data. Subsequently, we fine-tune this model with
EditLoRA using a small, artist-curated dataset of before-and-after image pairs
to capture distinct editing styles and techniques. To enhance consistency in
the generated results, we introduce a positional encoding reuse mechanism.
Additionally, we release a PhotoDoodle dataset featuring six high-quality
styles. Extensive experiments demonstrate the advanced performance and
robustness of our method in customized image editing, opening new possibilities
for artistic creation.
☆ RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers
Ke Cao, Jing Wang, Ao Ma, Jiasong Feng, Zhanjie Zhang, Xuanhua He, Shanyuan Liu, Bo Cheng, Dawei Leng, Yuhui Yin, Jie Zhang
The Diffusion Transformer plays a pivotal role in advancing text-to-image and
text-to-video generation, owing primarily to its inherent scalability. However,
existing controlled diffusion transformer methods incur significant parameter
and computational overheads and suffer from inefficient resource allocation due
to their failure to account for the varying relevance of control information
across different transformer layers. To address this, we propose the
Relevance-Guided Efficient Controllable Generation framework, RelaCtrl,
enabling efficient and resource-optimized integration of control signals into
the Diffusion Transformer. First, we evaluate the relevance of each layer in
the Diffusion Transformer to the control information by assessing the
"ControlNet Relevance Score"-i.e., the impact of skipping each control layer on
both the quality of generation and the control effectiveness during inference.
Based on the strength of the relevance, we then tailor the positioning,
parameter scale, and modeling capacity of the control layers to reduce
unnecessary parameters and redundant computations. Additionally, to further
improve efficiency, we replace the self-attention and FFN in the commonly used
copy block with the carefully designed Two-Dimensional Shuffle Mixer (TDSM),
enabling efficient implementation of both the token mixer and channel mixer.
Both qualitative and quantitative experimental results demonstrate that our
approach achieves superior performance with only 15% of the parameters and
computational complexity compared to PixArt-delta. More examples are available
at https://relactrl.github.io/RelaCtrl/.
comment: 15 pages, 9 figures
☆ A Similarity Paradigm Through Textual Regularization Without Forgetting
Prompt learning has emerged as a promising method for adapting pre-trained
visual-language models (VLMs) to a range of downstream tasks. While optimizing
the context can be effective for improving performance on specific tasks, it
can often lead to poor generalization performance on unseen classes or datasets
sampled from different distributions. It may be attributed to the fact that
textual prompts tend to overfit downstream data distributions, leading to the
forgetting of generalized knowledge derived from hand-crafted prompts. In this
paper, we propose a novel method called Similarity Paradigm with Textual
Regularization (SPTR) for prompt learning without forgetting. SPTR is a
two-pronged design based on hand-crafted prompts that is an inseparable
framework. 1) To avoid forgetting general textual knowledge, we introduce the
optimal transport as a textual regularization to finely ensure approximation
with hand-crafted features and tuning textual features. 2) In order to
continuously unleash the general ability of multiple hand-crafted prompts, we
propose a similarity paradigm for natural alignment score and adversarial
alignment score to improve model robustness for generalization. Both modules
share a common objective in addressing generalization issues, aiming to
maximize the generalization capability derived from multiple hand-crafted
prompts. Four representative tasks (i.e., non-generalization few-shot learning,
base-to-novel generalization, cross-dataset generalization, domain
generalization) across 11 datasets demonstrate that SPTR outperforms existing
prompt learning methods.
☆ CrossVTON: Mimicking the Logic Reasoning on Cross-category Virtual Try-on guided by Tri-zone Priors
Donghao Luo, Yujie Liang, Xu Peng, Xiaobin Hu, Boyuan Jiang, Chengming Xu, Taisong Jin, Chengjie Wang, Yanwei Fu
Despite remarkable progress in image-based virtual try-on systems, generating
realistic and robust fitting images for cross-category virtual try-on remains a
challenging task. The primary difficulty arises from the absence of human-like
reasoning, which involves addressing size mismatches between garments and
models while recognizing and leveraging the distinct functionalities of various
regions within the model images. To address this issue, we draw inspiration
from human cognitive processes and disentangle the complex reasoning required
for cross-category try-on into a structured framework. This framework
systematically decomposes the model image into three distinct regions: try-on,
reconstruction, and imagination zones. Each zone plays a specific role in
accommodating the garment and facilitating realistic synthesis. To endow the
model with robust reasoning capabilities for cross-category scenarios, we
propose an iterative data constructor. This constructor encompasses diverse
scenarios, including intra-category try-on, any-to-dress transformations
(replacing any garment category with a dress), and dress-to-any transformations
(replacing a dress with another garment category). Utilizing the generated
dataset, we introduce a tri-zone priors generator that intelligently predicts
the try-on, reconstruction, and imagination zones by analyzing how the input
garment is expected to align with the model image. Guided by these tri-zone
priors, our proposed method, CrossVTON, achieves state-of-the-art performance,
surpassing existing baselines in both qualitative and quantitative evaluations.
Notably, it demonstrates superior capability in handling cross-category virtual
try-on, meeting the complex demands of real-world applications.
☆ PPO-MI: Efficient Black-Box Model Inversion via Proximal Policy Optimization ICML 2025
Model inversion attacks pose a significant privacy risk by attempting to
reconstruct private training data from trained models. Most of the existing
methods either depend on gradient estimation or require white-box access to
model parameters, which limits their applicability in practical scenarios. In
this paper, we propose PPO-MI, a novel reinforcement learning-based framework
for black-box model inversion attacks. Our approach formulates the inversion
task as a Markov Decision Process, where an agent navigates the latent space of
a generative model to reconstruct private training samples using only model
predictions. By employing Proximal Policy Optimization (PPO) with a
momentum-based state transition mechanism, along with a reward function
balancing prediction accuracy and exploration, PPO-MI ensures efficient latent
space exploration and high query efficiency. We conduct extensive experiments
illustrates that PPO-MI outperforms the existing methods while require less
attack knowledge, and it is robust across various model architectures and
datasets. These results underline its effectiveness and generalizability in
practical black-box scenarios, raising important considerations for the privacy
vulnerabilities of deployed machine learning models.
comment: 6 pages, submitting to ICML 2025
☆ Topology-Aware Wavelet Mamba for Airway Structure Segmentation in Postoperative Recurrent Nasopharyngeal Carcinoma CT Scans
Haishan Huang, Pengchen Liang, Naier Lin, Luxi Wang, Bin Pu, Jianguo Chen, Qing Chang, Xia Shen, Guo Ran
Nasopharyngeal carcinoma (NPC) patients often undergo radiotherapy and
chemotherapy, which can lead to postoperative complications such as limited
mouth opening and joint stiffness, particularly in recurrent cases that require
re-surgery. These complications can affect airway function, making accurate
postoperative airway risk assessment essential for managing patient care.
Accurate segmentation of airway-related structures in postoperative CT scans is
crucial for assessing these risks. This study introduces TopoWMamba
(Topology-aware Wavelet Mamba), a novel segmentation model specifically
designed to address the challenges of postoperative airway risk evaluation in
recurrent NPC patients. TopoWMamba combines wavelet-based multi-scale feature
extraction, state-space sequence modeling, and topology-aware modules to
segment airway-related structures in CT scans robustly. By leveraging the
Wavelet-based Mamba Block (WMB) for hierarchical frequency decomposition and
the Snake Conv VSS (SCVSS) module to preserve anatomical continuity, TopoWMamba
effectively captures both fine-grained boundaries and global structural
context, crucial for accurate segmentation in complex postoperative scenarios.
Through extensive testing on the NPCSegCT dataset, TopoWMamba achieves an
average Dice score of 88.02%, outperforming existing models such as UNet,
Attention UNet, and SwinUNet. Additionally, TopoWMamba is tested on the SegRap
2023 Challenge dataset, where it shows a significant improvement in trachea
segmentation with a Dice score of 95.26%. The proposed model provides a strong
foundation for automated segmentation, enabling more accurate postoperative
airway risk evaluation.
comment: 20 pages, 11 figures, 6 tables
☆ Weed Detection using Convolutional Neural Network
In this paper we use convolutional neural networks (CNNs) for weed detection
in agricultural land. We specifically investigate the application of two CNN
layer types, Conv2d and dilated Conv2d, for weed detection in crop fields. The
suggested method extracts features from the input photos using pre-trained
models, which are subsequently adjusted for weed detection. The findings of the
experiment, which used a sizable collection of dataset consisting of 15336
segments, being 3249 of soil, 7376 of soybean, 3520 grass and 1191 of broadleaf
weeds. show that the suggested approach can accurately and successfully detect
weeds at an accuracy of 94%. This study has significant ramifications for
lowering the usage of toxic herbicides and increasing the effectiveness of weed
management in agriculture.
☆ Triply Laplacian Scale Mixture Modeling for Seismic Data Noise Suppression
Sirui Pan, Zhiyuan Zha, Shigang Wang, Yue Li, Zipei Fan, Gang Yan, Binh T. Nguyen, Bihan Wen, Ce Zhu
Sparsity-based tensor recovery methods have shown great potential in
suppressing seismic data noise. These methods exploit tensor sparsity measures
capturing the low-dimensional structures inherent in seismic data tensors to
remove noise by applying sparsity constraints through soft-thresholding or
hard-thresholding operators. However, in these methods, considering that real
seismic data are non-stationary and affected by noise, the variances of tensor
coefficients are unknown and may be difficult to accurately estimate from the
degraded seismic data, leading to undesirable noise suppression performance. In
this paper, we propose a novel triply Laplacian scale mixture (TLSM) approach
for seismic data noise suppression, which significantly improves the estimation
accuracy of both the sparse tensor coefficients and hidden scalar parameters.
To make the optimization problem manageable, an alternating direction method of
multipliers (ADMM) algorithm is employed to solve the proposed TLSM-based
seismic data noise suppression problem. Extensive experimental results on
synthetic and field seismic data demonstrate that the proposed TLSM algorithm
outperforms many state-of-the-art seismic data noise suppression methods in
both quantitative and qualitative evaluations while providing exceptional
computational efficiency.
☆ SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images
Positron Emission Tomography (PET) imaging plays a crucial role in modern
medical diagnostics by revealing the metabolic processes within a patient's
body, which is essential for quantification of therapy response and monitoring
treatment progress. However, the segmentation of PET images presents unique
challenges due to their lower contrast and less distinct boundaries compared to
other structural medical modalities. Recent developments in segmentation
foundation models have shown superior versatility across diverse natural image
segmentation tasks. Despite the efforts of medical adaptations, these works
primarily focus on structural medical images with detailed physiological
structural information and exhibit poor generalization ability when adapted to
molecular PET imaging. In this paper, we collect and construct PETS-5k, the
largest PET segmentation dataset to date, comprising 5,731 three-dimensional
whole-body PET images and encompassing over 1.3M 2D images. Based on the
established dataset, we develop SegAnyPET, a modality-specific 3D foundation
model for universal promptable segmentation from PET images. To issue the
challenge of discrepant annotation quality of PET images, we adopt a cross
prompting confident learning (CPCL) strategy with an uncertainty-guided
self-rectification process to robustly learn segmentation from high-quality
labeled data and low-quality noisy labeled data. Experimental results
demonstrate that SegAnyPET can correctly segment seen and unseen targets using
only one or a few prompt points, outperforming state-of-the-art foundation
models and task-specific fully supervised models with higher accuracy and
strong generalization ability for universal segmentation. As the first
foundation model for PET images, we believe that SegAnyPET will advance the
applications to various downstream tasks for molecular imaging.
☆ Towards Accurate Binary Spiking Neural Networks: Learning with Adaptive Gradient Modulation Mechanism AAAI
Binary Spiking Neural Networks (BSNNs) inherit the eventdriven paradigm of
SNNs, while also adopting the reduced storage burden of binarization
techniques. These distinct advantages grant BSNNs lightweight and
energy-efficient characteristics, rendering them ideal for deployment on
resource-constrained edge devices. However, due to the binary synaptic weights
and non-differentiable spike function, effectively training BSNNs remains an
open question. In this paper, we conduct an in-depth analysis of the challenge
for BSNN learning, namely the frequent weight sign flipping problem. To
mitigate this issue, we propose an Adaptive Gradient Modulation Mechanism
(AGMM), which is designed to reduce the frequency of weight sign flipping by
adaptively adjusting the gradients during the learning process. The proposed
AGMM can enable BSNNs to achieve faster convergence speed and higher accuracy,
effectively narrowing the gap between BSNNs and their full-precision
equivalents. We validate AGMM on both static and neuromorphic datasets, and
results indicate that it achieves state-of-the-art results among BSNNs. This
work substantially reduces storage demands and enhances SNNs' inherent energy
efficiency, making them highly feasible for resource-constrained environments.
comment: 9 pages, 8 figures, AAAI conference
☆ A Collaborative Jade Recognition System for Mobile Devices Based on Lightweight and Large Models
With the widespread adoption and development of mobile devices, vision-based
recognition applications have become a hot topic in research. Jade, as an
important cultural heritage and artistic item, has significant applications in
fields such as jewelry identification and cultural relic preservation. However,
existing jade recognition systems still face challenges in mobile
implementation, such as limited computing resources, real-time requirements,
and accuracy issues. To address these challenges, this paper proposes a jade
recognition system based on size model collaboration, aiming to achieve
efficient and accurate jade identification using mobile devices such as
smartphones.First, we design a size model based on multi-scale image
processing, extracting key visual information by analyzing jade's dimensions,
shapes, and surface textures. Then, a collaborative multi-model classification
framework is built by combining deep learning and traditional computer vision
algorithms. This framework can effectively select and adjust models based on
different jade characteristics, providing high accuracy results across various
environments and devices.Experimental results show that the proposed system can
provide high recognition accuracy and fast processing time on mobile devices,
while consuming relatively low computational resources. The system not only
holds great application potential but also provides new ideas and technical
support for the intelligent development of jade identification.
☆ Textured 3D Regenerative Morphing with 3D Diffusion Prior
Textured 3D morphing creates smooth and plausible interpolation sequences
between two 3D objects, focusing on transitions in both shape and texture. This
is important for creative applications like visual effects in filmmaking.
Previous methods rely on establishing point-to-point correspondences and
determining smooth deformation trajectories, which inherently restrict them to
shape-only morphing on untextured, topologically aligned datasets. This
restriction leads to labor-intensive preprocessing and poor generalization. To
overcome these challenges, we propose a method for 3D regenerative morphing
using a 3D diffusion prior. Unlike previous methods that depend on explicit
correspondences and deformations, our method eliminates the additional need for
obtaining correspondence and uses the 3D diffusion prior to generate morphing.
Specifically, we introduce a 3D diffusion model and interpolate the source and
target information at three levels: initial noise, model parameters, and
condition features. We then explore an Attention Fusion strategy to generate
more smooth morphing sequences. To further improve the plausibility of semantic
interpolation and the generated 3D surfaces, we propose two strategies: (a)
Token Reordering, where we match approximate tokens based on semantic analysis
to guide implicit correspondences in the denoising process of the diffusion
model, and (b) Low-Frequency Enhancement, where we enhance low-frequency
signals in the tokens to improve the quality of generated surfaces.
Experimental results show that our method achieves superior smoothness and
plausibility in 3D morphing across diverse cross-category object pairs,
offering a novel regenerative method for 3D morphing with textured
representations.
☆ ODVerse33: Is the New YOLO Version Always Better? A Multi Domain benchmark from YOLO v5 to v11
You Look Only Once (YOLO) models have been widely used for building real-time
object detectors across various domains. With the increasing frequency of new
YOLO versions being released, key questions arise. Are the newer versions
always better than their previous versions? What are the core innovations in
each YOLO version and how do these changes translate into real-world
performance gains? In this paper, we summarize the key innovations from YOLOv1
to YOLOv11, introduce a comprehensive benchmark called ODverse33, which
includes 33 datasets spanning 11 diverse domains (Autonomous driving,
Agricultural, Underwater, Medical, Videogame, Industrial, Aerial, Wildlife,
Retail, Microscopic, and Security), and explore the practical impact of model
improvements in real-world, multi-domain applications through extensive
experimental results. We hope this study can provide some guidance to the
extensive users of object detection models and give some references for future
real-time object detector development.
comment: 18 pages, 4 figures, 7 tables
☆ PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC
Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Changsheng Xu, Weiming Hu, Fei Huang
In the field of MLLM-based GUI agents, compared to smartphones, the PC
scenario not only features a more complex interactive environment, but also
involves more intricate intra- and inter-app workflows. To address these
issues, we propose a hierarchical agent framework named PC-Agent. Specifically,
from the perception perspective, we devise an Active Perception Module (APM) to
overcome the inadequate abilities of current MLLMs in perceiving screenshot
content. From the decision-making perspective, to handle complex user
instructions and interdependent subtasks more effectively, we propose a
hierarchical multi-agent collaboration architecture that decomposes
decision-making processes into Instruction-Subtask-Action levels. Within this
architecture, three agents (i.e., Manager, Progress and Decision) are set up
for instruction decomposition, progress tracking and step-by-step
decision-making respectively. Additionally, a Reflection agent is adopted to
enable timely bottom-up error feedback and adjustment. We also introduce a new
benchmark PC-Eval with 25 real-world complex instructions. Empirical results on
PC-Eval show that our PC-Agent achieves a 32% absolute improvement of task
success rate over previous state-of-the-art methods. The code will be publicly
available.
comment: 14 pages, 7 figures
☆ OrchardDepth: Precise Metric Depth Estimation of Orchard Scene from Monocular Camera Images
Monocular depth estimation is a rudimentary task in robotic perception.
Recently, with the development of more accurate and robust neural network
models and different types of datasets, monocular depth estimation has
significantly improved performance and efficiency. However, most of the
research in this area focuses on very concentrated domains. In particular, most
of the benchmarks in outdoor scenarios belong to urban environments for the
improvement of autonomous driving devices, and these benchmarks have a massive
disparity with the orchard/vineyard environment, which is hardly helpful for
research in the primary industry. Therefore, we propose OrchardDepth, which
fills the gap in the estimation of the metric depth of the monocular camera in
the orchard/vineyard environment. In addition, we present a new retraining
method to improve the training result by monitoring the consistent
regularization between dense depth maps and sparse points. Our method improves
the RMSE of depth estimation in the orchard environment from 1.5337 to 0.6738,
proving our method's validation.
comment: 10 pages, 5 figures, Australasian Conference on Robotics and
Automation, ACRA, 2024
☆ LLM-EvRep: Learning an LLM-Compatible Event Representation Using a Self-Supervised Framework WWW
Recent advancements in event-based recognition have demonstrated significant
promise, yet most existing approaches rely on extensive training, limiting
their adaptability for efficient processing of event-driven visual content.
Meanwhile, large language models (LLMs) have exhibited remarkable zero-shot
capabilities across diverse domains, but their application to event-based
visual recognition remains largely unexplored. To bridge this gap, we propose
\textbf{LLM-EvGen}, an event representation generator that produces
LLM-compatible event representations \textbf{LLM-EvRep}, thereby enhancing the
performance of LLMs on event recognition tasks. The generator is trained using
a self-supervised framework, aligning the generated representations with
semantic consistency and structural fidelity. Comprehensive experiments were
conducted on three datasets: N-ImageNet, N-Caltech101, and N-MNIST. The results
demonstrate that our method, \textbf{LLM-EvRep}, outperforms the event-to-video
method, E2VID, by 15.93\%, 0.82\%, and 50.21\%, respectively, in recognition
tasks when evaluated using GPT-4o.
comment: 6 pages, 2 figures,Companion Proceedings of the ACM Web Conference
2025 (WWW Companion '25)
☆ Money Recognition for the Visually Impaired: A Case Study on Sri Lankan Banknotes
Currency note recognition is a critical accessibility need for blind
individuals, as identifying banknotes accurately can impact their independence
and security in financial transactions. Several traditional and technological
initiatives have been taken to date. Nevertheless, these approaches are less
user-friendly and have made it more challenging for blind people to identify
banknotes. This research proposes a user-friendly stand-alone system for the
identification of Sri Lankan currency notes. A custom-created dataset of images
of Sri Lankan currency notes was used to fine-tune an EfficientDet model. The
currency note recognition model achieved 0.9847 AP on the validation dataset
and performs exceptionally well in real-world scenarios. The high accuracy and
the intuitive interface have enabled blind individuals to quickly and
accurately identify currency denominations, ultimately encouraging
accessibility and independence.
☆ EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement
Wenhui Zhu, Xuanzhao Dong, Xin Li, Yujian Xiong, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Zhangsihao Yang, Yi Su, Oana Dumitrascu, Yalin Wang
Over the past decade, generative models have achieved significant success in
enhancement fundus images.However, the evaluation of these models still
presents a considerable challenge. A comprehensive evaluation benchmark for
fundus image enhancement is indispensable for three main reasons: 1) The
existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to
downstream real-world clinical research (e.g., Vessel morphology consistency).
2) There is a lack of comprehensive evaluation for both paired and unpaired
enhancement methods, along with the need for expert protocols to accurately
assess clinical value. 3) An ideal evaluation system should provide insights to
inform future developments of fundus image enhancement. To this end, we propose
a novel comprehensive benchmark, EyeBench, to provide insights that align
enhancement models with clinical needs, offering a foundation for future work
to improve the clinical relevance and applicability of generative models for
fundus image enhancement. EyeBench has three appealing properties: 1)
multi-dimensional clinical alignment downstream evaluation: In addition to
evaluating the enhancement task, we provide several clinically significant
downstream tasks for fundus images, including vessel segmentation, DR grading,
denoising generalization, and lesion segmentation. 2) Medical expert-guided
evaluation design: We introduce a novel dataset that promote comprehensive and
fair comparisons between paired and unpaired methods and includes a manual
evaluation protocol by medical experts. 3) Valuable insights: Our benchmark
study provides a comprehensive and rigorous evaluation of existing methods
across different downstream tasks, assisting medical experts in making informed
choices. Additionally, we offer further analysis of the challenges faced by
existing methods. The code is available at
\url{https://github.com/Retinal-Research/EyeBench}
☆ Pandora3D: A Comprehensive Framework for High-Quality 3D Shape and Texture Generation
Jiayu Yang, Taizhang Shang, Weixuan Sun, Xibin Song, Ziang Chen, Senbo Wang, Shenzhou Chen, Weizhe Liu, Hongdong Li, Pan Ji
This report presents a comprehensive framework for generating high-quality 3D
shapes and textures from diverse input prompts, including single images,
multi-view images, and text descriptions. The framework consists of 3D shape
generation and texture generation. (1). The 3D shape generation pipeline
employs a Variational Autoencoder (VAE) to encode implicit 3D geometries into a
latent space and a diffusion network to generate latents conditioned on input
prompts, with modifications to enhance model capacity. An alternative
Artist-Created Mesh (AM) generation approach is also explored, yielding
promising results for simpler geometries. (2). Texture generation involves a
multi-stage process starting with frontal images generation followed by
multi-view images generation, RGB-to-PBR texture conversion, and
high-resolution multi-view texture refinement. A consistency scheduler is
plugged into every stage, to enforce pixel-wise consistency among multi-view
textures during inference, ensuring seamless integration.
The pipeline demonstrates effective handling of diverse input formats,
leveraging advanced neural architectures and novel methodologies to produce
high-quality 3D content. This report details the system architecture,
experimental results, and potential future directions to improve and expand the
framework. The source code and pretrained weights are released at:
\url{https://github.com/Tencent/Tencent-XR-3DGen}.
comment: Tencent XR 3D Gen
☆ OG-Gaussian: Occupancy Based Street Gaussians for Autonomous Driving
Accurate and realistic 3D scene reconstruction enables the lifelike creation
of autonomous driving simulation environments. With advancements in 3D Gaussian
Splatting (3DGS), previous studies have applied it to reconstruct complex
dynamic driving scenes. These methods typically require expensive LiDAR sensors
and pre-annotated datasets of dynamic objects. To address these challenges, we
propose OG-Gaussian, a novel approach that replaces LiDAR point clouds with
Occupancy Grids (OGs) generated from surround-view camera images using
Occupancy Prediction Network (ONet). Our method leverages the semantic
information in OGs to separate dynamic vehicles from static street background,
converting these grids into two distinct sets of initial point clouds for
reconstructing both static and dynamic objects. Additionally, we estimate the
trajectories and poses of dynamic objects through a learning-based approach,
eliminating the need for complex manual annotations. Experiments on Waymo Open
dataset demonstrate that OG-Gaussian is on par with the current
state-of-the-art in terms of reconstruction quality and rendering speed,
achieving an average PSNR of 35.13 and a rendering speed of 143 FPS, while
significantly reducing computational costs and economic overhead.
☆ Designing Parameter and Compute Efficient Diffusion Transformers using Distillation
Diffusion Transformers (DiTs) with billions of model parameters form the
backbone of popular image and video generation models like DALL.E,
Stable-Diffusion and SORA. Though these models are necessary in many
low-latency applications like Augmented/Virtual Reality, they cannot be
deployed on resource-constrained Edge devices (like Apple Vision Pro or Meta
Ray-Ban glasses) due to their huge computational complexity. To overcome this,
we turn to knowledge distillation and perform a thorough design-space
exploration to achieve the best DiT for a given parameter size. In particular,
we provide principles for how to choose design knobs such as depth, width,
attention heads and distillation setup for a DiT. During the process, a
three-way trade-off emerges between model performance, size and speed that is
crucial for Edge implementation of diffusion. We also propose two distillation
approaches - Teaching Assistant (TA) method and Multi-In-One (MI1) method - to
perform feature distillation in the DiT context. Unlike existing solutions, we
demonstrate and benchmark the efficacy of our approaches on practical Edge
devices such as NVIDIA Jetson Orin Nano.
comment: 4 pages
☆ H3DE-Net: Efficient and Accurate 3D Landmark Detection in Medical Imaging
3D landmark detection is a critical task in medical image analysis, and
accurately detecting anatomical landmarks is essential for subsequent medical
imaging tasks. However, mainstream deep learning methods in this field struggle
to simultaneously capture fine-grained local features and model global spatial
relationships, while maintaining a balance between accuracy and computational
efficiency. Local feature extraction requires capturing fine-grained anatomical
details, while global modeling requires understanding the spatial relationships
within complex anatomical structures. The high-dimensional nature of 3D volume
further exacerbates these challenges, as landmarks are sparsely distributed,
leading to significant computational costs. Therefore, achieving efficient and
precise 3D landmark detection remains a pressing challenge in medical image
analysis.
In this work, We propose a \textbf{H}ybrid \textbf{3}D \textbf{DE}tection
\textbf{Net}(H3DE-Net), a novel framework that combines CNNs for local feature
extraction with a lightweight attention mechanism designed to efficiently
capture global dependencies in 3D volumetric data. This mechanism employs a
hierarchical routing strategy to reduce computational cost while maintaining
global context modeling. To our knowledge, H3DE-Net is the first 3D landmark
detection model that integrates such a lightweight attention mechanism with
CNNs. Additionally, integrating multi-scale feature fusion further enhances
detection accuracy and robustness. Experimental results on a public CT dataset
demonstrate that H3DE-Net achieves state-of-the-art(SOTA) performance,
significantly improving accuracy and robustness, particularly in scenarios with
missing landmarks or complex anatomical variations. We aready open-source our
project, including code, data and model weights.
☆ Asymmetric Co-Training for Source-Free Few-Shot Domain Adaptation
Source-free unsupervised domain adaptation (SFUDA) has gained significant
attention as an alternative to traditional unsupervised domain adaptation
(UDA), which relies on the constant availability of labeled source data.
However, SFUDA approaches come with inherent limitations that are frequently
overlooked. These challenges include performance degradation when the unlabeled
target data fails to meet critical assumptions, such as having a closed-set
label distribution identical to that of the source domain, or when sufficient
unlabeled target data is unavailable-a common situation in real-world
applications. To address these issues, we propose an asymmetric co-training
(ACT) method specifically designed for the SFFSDA scenario. SFFSDA presents a
more practical alternative to SFUDA, as gathering a few labeled target
instances is more feasible than acquiring large volumes of unlabeled target
data in many real-world contexts. Our ACT method begins by employing a
weak-strong augmentation to enhance data diversity. Then we use a two-step
optimization process to train the target model. In the first step, we optimize
the label smoothing cross-entropy loss, the entropy of the class-conditional
distribution, and the reverse-entropy loss to bolster the model's
discriminative ability while mitigating overfitting. The second step focuses on
reducing redundancy in the output space by minimizing classifier determinacy
disparity. Extensive experiments across four benchmarks demonstrate the
superiority of our ACT approach, which outperforms state-of-the-art SFUDA
methods and transfer learning techniques. Our findings suggest that adapting a
source pre-trained model using only a small amount of labeled target data
offers a practical and dependable solution. The code is available at
https://github.com/gengxuli/ACT.
comment: 13 pages
☆ Spatial and Frequency Domain Adaptive Fusion Network for Image Deblurring
Image deblurring aims to reconstruct a latent sharp image from its
corresponding blurred one. Although existing methods have achieved good
performance, most of them operate exclusively in either the spatial domain or
the frequency domain, rarely exploring solutions that fuse both domains. In
this paper, we propose a spatial-frequency domain adaptive fusion network
(SFAFNet) to address this limitation. Specifically, we design a gated
spatial-frequency domain feature fusion block (GSFFBlock), which consists of
three key components: a spatial domain information module, a frequency domain
information dynamic generation module (FDGM), and a gated fusion module (GFM).
The spatial domain information module employs the NAFBlock to integrate local
information. Meanwhile, in the FDGM, we design a learnable low-pass filter that
dynamically decomposes features into separate frequency subbands, capturing the
image-wide receptive field and enabling the adaptive exploration of global
contextual information. Additionally, to facilitate information flow and the
learning of complementary representations. In the GFM, we present a gating
mechanism (GATE) to re-weight spatial and frequency domain features, which are
then fused through the cross-attention mechanism (CAM). Experimental results
demonstrate that our SFAFNet performs favorably compared to state-of-the-art
approaches on commonly used benchmarks.
☆ Bridging Text and Vision: A Multi-View Text-Vision Registration Approach for Cross-Modal Place Recognition
Mobile robots necessitate advanced natural language understanding
capabilities to accurately identify locations and perform tasks such as package
delivery. However, traditional visual place recognition (VPR) methods rely
solely on single-view visual information and cannot interpret human language
descriptions. To overcome this challenge, we bridge text and vision by
proposing a multiview (360{\deg} views of the surroundings) text-vision
registration approach called Text4VPR for place recognition task, which is the
first method that exclusively utilizes textual descriptions to match a database
of images. Text4VPR employs the frozen T5 language model to extract global
textual embeddings. Additionally, it utilizes the Sinkhorn algorithm with
temperature coefficient to assign local tokens to their respective clusters,
thereby aggregating visual descriptors from images. During the training stage,
Text4VPR emphasizes the alignment between individual text-image pairs for
precise textual description. In the inference stage, Text4VPR uses the Cascaded
Cross-Attention Cosine Alignment (CCCA) to address the internal mismatch
between text and image groups. Subsequently, Text4VPR performs precisely place
match based on the descriptions of text-image groups. On Street360Loc, the
first text to image VPR dataset we created, Text4VPR builds a robust baseline,
achieving a leading top-1 accuracy of 57% and a leading top-10 accuracy of 92%
within a 5-meter radius on the test set, which indicates that localization from
textual descriptions to images is not only feasible but also holds significant
potential for further advancement, as shown in Figure 1.
comment: 8 pages, 4 figures, conference
☆ Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models
Reward models play an essential role in training vision-language models
(VLMs) by assessing output quality to enable aligning with human preferences.
Despite their importance, the research community lacks comprehensive open
benchmarks for evaluating multimodal reward models in VLMs. To address this
gap, we introduce Multimodal RewardBench, an expert-annotated benchmark
covering six domains: general correctness, preference, knowledge, reasoning,
safety, and visual question-answering. Our dataset comprises 5,211 annotated
(prompt, chosen response, rejected response) triplets collected from various
VLMs. In evaluating a range of VLM judges, we find that even the top-performing
models, Gemini 1.5 Pro and Claude 3.5 Sonnet, achieve only 72% overall
accuracy. Notably, most models struggle in the reasoning and safety domains.
These findings suggest that Multimodal RewardBench offers a challenging testbed
for advancing reward model development across multiple domains. We release the
benchmark at https://github.com/facebookresearch/multimodal_rewardbench.
comment: Dataset available at
https://github.com/facebookresearch/multimodal_rewardbench
☆ Stereo Image Coding for Machines with Joint Visual Feature Compression
2D image coding for machines (ICM) has achieved great success in coding
efficiency, while less effort has been devoted to stereo image fields. To
promote the efficiency of stereo image compression (SIC) and intelligent
analysis, the stereo image coding for machines (SICM) is formulated and
explored in this paper. More specifically, a machine vision-oriented stereo
feature compression network (MVSFC-Net) is proposed for SICM, where the stereo
visual features are effectively extracted, compressed, and transmitted for 3D
visual task. To efficiently compress stereo visual features in MVSFC-Net, a
stereo multi-scale feature compression (SMFC) module is designed to gradually
transform sparse stereo multi-scale features into compact joint visual
representations by removing spatial, inter-view, and cross-scale redundancies
simultaneously. Experimental results show that the proposed MVSFC-Net obtains
superior compression efficiency as well as 3D visual task performance, when
compared with the existing ICM anchors recommended by MPEG and the
state-of-the-art SIC method.
☆ Bayesian SegNet for Semantic Segmentation with Improved Interpretation of Microstructural Evolution During Irradiation of Materials
Understanding the relationship between the evolution of microstructures of
irradiated LiAlO2 pellets and tritium diffusion, retention and release could
improve predictions of tritium-producing burnable absorber rod performance.
Given expert-labeled segmented images of irradiated and unirradiated pellets,
we trained Deep Convolutional Neural Networks to segment images into defect,
grain, and boundary classes. Qualitative microstructural information was
calculated from these segmented images to facilitate the comparison of
unirradiated and irradiated pellets. We tested modifications to improve the
sensitivity of the model, including incorporating meta-data into the model and
utilizing uncertainty quantification. The predicted segmentation was similar to
the expert-labeled segmentation for most methods of microstructural
qualification, including pixel proportion, defect area, and defect density.
Overall, the high performance metrics for the best models for both irradiated
and unirradiated images shows that utilizing neural network models is a viable
alternative to expert-labeled images.
☆ NeRF-3DTalker: Neural Radiance Field with 3D Prior Aided Audio Disentanglement for Talking Head Synthesis ICASSP 2025
Talking head synthesis is to synthesize a lip-synchronized talking head video
using audio. Recently, the capability of NeRF to enhance the realism and
texture details of synthesized talking heads has attracted the attention of
researchers. However, most current NeRF methods based on audio are exclusively
concerned with the rendering of frontal faces. These methods are unable to
generate clear talking heads in novel views. Another prevalent challenge in
current 3D talking head synthesis is the difficulty in aligning acoustic and
visual spaces, which often results in suboptimal lip-syncing of the generated
talking heads. To address these issues, we propose Neural Radiance Field with
3D Prior Aided Audio Disentanglement for Talking Head Synthesis
(NeRF-3DTalker). Specifically, the proposed method employs 3D prior information
to synthesize clear talking heads with free views. Additionally, we propose a
3D Prior Aided Audio Disentanglement module, which is designed to disentangle
the audio into two distinct categories: features related to 3D awarded speech
movements and features related to speaking style. Moreover, to reposition the
generated frames that are distant from the speaker's motion space in the real
space, we have devised a local-global Standardized Space. This method
normalizes the irregular positions in the generated frames from both global and
local semantic perspectives. Through comprehensive qualitative and quantitative
experiments, it has been demonstrated that our NeRF-3DTalker outperforms
state-of-the-art in synthesizing realistic talking head videos, exhibiting
superior image quality and lip synchronization. Project page:
https://nerf-3dtalker.github.io/NeRF-3Dtalker.
comment: Accepted by ICASSP 2025
☆ Deep learning based infrared small object segmentation: Challenges and future directions
Infrared sensing is a core method for supporting unmanned systems, such as
autonomous vehicles and drones. Recently, infrared sensors have been widely
deployed on mobile and stationary platforms for detection and classification of
objects from long distances and in wide field of views. Given its success in
the vision image analysis domain, deep learning has also been applied for
object recognition in infrared images. However, techniques that have proven
successful in visible light perception face new challenges in the infrared
domain. These challenges include extremely low signal-to-noise ratios in
infrared images, very small and blurred objects of interest, and limited
availability of labeled/unlabeled training data due to the specialized nature
of infrared sensors. Numerous methods have been proposed in the literature for
the detection and classification of small objects in infrared images achieving
varied levels of success. There is a need for a survey paper that critically
analyzes existing techniques in this domain, identifies unsolved challenges and
provides future research directions. This paper fills the gap and offers a
concise and insightful review of deep learning-based methods. It also
identifies the challenges faced by existing infrared object segmentation
methods and provides a structured review of existing infrared perception
methods from the perspective of these challenges and highlights the motivations
behind the various approaches. Finally, this review suggests promising future
directions based on recent advancements within this domain.
comment: This is a submitted version of a paper accepted by Information
Fusion. If you want a better reading experience, please refer to the final
published version of Information Fusion
♻ ☆ VidStyleODE: Disentangled Video Editing via StyleGAN and NeuralODEs
We propose $\textbf{VidStyleODE}$, a spatiotemporally continuous disentangled
$\textbf{Vid}$eo representation based upon $\textbf{Style}$GAN and
Neural-$\textbf{ODE}$s. Effective traversal of the latent space learned by
Generative Adversarial Networks (GANs) has been the basis for recent
breakthroughs in image editing. However, the applicability of such advancements
to the video domain has been hindered by the difficulty of representing and
controlling videos in the latent space of GANs. In particular, videos are
composed of content (i.e., appearance) and complex motion components that
require a special mechanism to disentangle and control. To achieve this,
VidStyleODE encodes the video content in a pre-trained StyleGAN $\mathcal{W}_+$
space and benefits from a latent ODE component to summarize the spatiotemporal
dynamics of the input video. Our novel continuous video generation process then
combines the two to generate high-quality and temporally consistent videos with
varying frame rates. We show that our proposed method enables a variety of
applications on real videos: text-guided appearance manipulation, motion
manipulation, image animation, and video interpolation and extrapolation.
Project website: https://cyberiada.github.io/VidStyleODE
comment: Project website: https://cyberiada.github.io/VidStyleODE
♻ ☆ Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning
Recent advancements in 3D Large Language Models (3DLLMs) have highlighted
their potential in building general-purpose agents in the 3D real world, yet
challenges remain due to the lack of high-quality robust instruction-following
data, leading to limited discriminative power and generalization of 3DLLMs. In
this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale
instruction-following data generated by our novel data engine, Robust
Instruction Generation (RIG) engine. RIG generates two key instruction data: 1)
the Adversarial Instruction-following data, which features mixed negative and
positive samples to enhance the model's discriminative understanding. 2) the
Diverse Instruction-following data, which contains various instruction styles
to enhance model's generalization. As a result, we construct 1 million
instruction-following data, consisting of 344K Adversarial samples, 508K
Diverse samples, and 165K benchmark training set samples. To better handle
these complex instructions, Robin3D first incorporates Relation-Augmented
Projector to enhance spatial understanding, and then strengthens the object
referring and grounding ability through ID-Feature Bonding. Robin3D
consistently outperforms previous methods across five widely-used 3D multimodal
learning benchmarks, without the need for task-specific fine-tuning. Notably,
we achieve a 7.8\% improvement in the grounding task (Multi3DRefer) and a 6.9\%
improvement in the captioning task (Scan2Cap).
comment: 8 pages
♻ ☆ Data Attribution for Text-to-Image Models by Unlearning Synthesized Images NeurIPS 2024
The goal of data attribution for text-to-image models is to identify the
training images that most influence the generation of a new image. Influence is
defined such that, for a given output, if a model is retrained from scratch
without the most influential images, the model would fail to reproduce the same
output. Unfortunately, directly searching for these influential images is
computationally infeasible, since it would require repeatedly retraining models
from scratch. In our work, we propose an efficient data attribution method by
simulating unlearning the synthesized image. We achieve this by increasing the
training loss on the output image, without catastrophic forgetting of other,
unrelated concepts. We then identify training images with significant loss
deviations after the unlearning process and label these as influential. We
evaluate our method with a computationally intensive but "gold-standard"
retraining from scratch and demonstrate our method's advantages over previous
methods.
comment: NeurIPS 2024 camera ready version. Project page:
https://peterwang512.github.io/AttributeByUnlearning Code:
https://github.com/PeterWang512/AttributeByUnlearning
♻ ☆ Sketch2CAD: 3D CAD Model Reconstruction from 2D Sketch using Visual Transformer
Current 3D reconstruction methods typically generate outputs in the form of
voxels, point clouds, or meshes. However, each of these formats has inherent
limitations, such as rough surfaces and distorted structures. Additionally,
these data types are not ideal for further manual editing and post-processing.
In this paper, we present a novel 3D reconstruction method designed to overcome
these disadvantages by reconstructing CAD-compatible models. We trained a
visual transformer to predict a "scene descriptor" from a single 2D wire-frame
image. This descriptor includes essential information, such as object types and
parameters like position, rotation, and size. Using the predicted parameters, a
3D scene can be reconstructed with 3D modeling software that has programmable
interfaces, such as Rhino Grasshopper, to build highly editable 3D models in
the form of B-rep. To evaluate our proposed model, we created two datasets: one
consisting of simple scenes and another with more complex scenes. The test
results indicate the model's capability to accurately reconstruct simple scenes
while highlighting its difficulties with more complex ones.
♻ ☆ Towards Understanding Why Label Smoothing Degrades Selective Classification and How to Fix It ICLR 2025
Label smoothing (LS) is a popular regularisation method for training neural
networks as it is effective in improving test accuracy and is simple to
implement. ``Hard'' one-hot labels are ``smoothed'' by uniformly distributing
probability mass to other classes, reducing overfitting. Prior work has
suggested that in some cases LS can degrade selective classification (SC) --
where the aim is to reject misclassifications using a model's uncertainty. In
this work, we first demonstrate empirically across an extended range of
large-scale tasks and architectures that LS consistently degrades SC. We then
address a gap in existing knowledge, providing an explanation for this
behaviour by analysing logit-level gradients: LS degrades the uncertainty rank
ordering of correct vs incorrect predictions by suppressing the max logit more
when a prediction is likely to be correct, and less when it is likely to be
wrong. This elucidates previously reported experimental results where strong
classifiers underperform in SC. We then demonstrate the empirical effectiveness
of post-hoc logit normalisation for recovering lost SC performance caused by
LS. Furthermore, linking back to our gradient analysis, we again provide an
explanation for why such normalisation is effective.
comment: Published as a conference paper at ICLR 2025
♻ ☆ YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection
We aim at providing the object detection community with an efficient and
performant object detector, termed YOLO-MS. The core design is based on a
series of investigations on how multi-branch features of the basic block and
convolutions with different kernel sizes affect the detection performance of
objects at different scales. The outcome is a new strategy that can
significantly enhance multi-scale feature representations of real-time object
detectors. To verify the effectiveness of our work, we train our YOLO-MS on the
MS COCO dataset from scratch without relying on any other large-scale datasets,
like ImageNet or pre-trained weights. Without bells and whistles, our YOLO-MS
outperforms the recent state-of-the-art real-time object detectors, including
YOLO-v7, RTMDet, and YOLO-v8. Taking the XS version of YOLO-MS as an example,
it can achieve an AP score of 42+% on MS COCO, which is about 2% higher than
RTMDet with the same model size. Furthermore, our work can also serve as a
plug-and-play module for other YOLO models. Typically, our method significantly
advances the APs, APl, and AP of YOLOv8-N from 18%+, 52%+, and 37%+ to 20%+,
55%+, and 40%+, respectively, with even fewer parameters and MACs. Code and
trained models are publicly available at
https://github.com/FishAndWasabi/YOLO-MS. We also provide the Jittor version at
https://github.com/NK-JittorCV/nk-yolo.
comment: 13 pages, 8 figures
♻ ☆ Learned Image Transmission with Hierarchical Variational Autoencoder
In this paper, we introduce an innovative hierarchical joint source-channel
coding (HJSCC) framework for image transmission, utilizing a hierarchical
variational autoencoder (VAE). Our approach leverages a combination of
bottom-up and top-down paths at the transmitter to autoregressively generate
multiple hierarchical representations of the original image. These
representations are then directly mapped to channel symbols for transmission by
the JSCC encoder. We extend this framework to scenarios with a feedback link,
modeling transmission over a noisy channel as a probabilistic sampling process
and deriving a novel generative formulation for JSCC with feedback. Compared
with existing approaches, our proposed HJSCC provides enhanced adaptability by
dynamically adjusting transmission bandwidth, encoding these representations
into varying amounts of channel symbols. Extensive experiments on images of
varying resolutions demonstrate that our proposed model outperforms existing
baselines in rate-distortion performance and maintains robustness against
channel noise. The source code will be made available upon acceptance.
♻ ☆ PFDiff: Training-Free Acceleration of Diffusion Models Combining Past and Future Scores ICLR 2025
Diffusion Probabilistic Models (DPMs) have shown remarkable potential in
image generation, but their sampling efficiency is hindered by the need for
numerous denoising steps. Most existing solutions accelerate the sampling
process by proposing fast ODE solvers. However, the inevitable discretization
errors of the ODE solvers are significantly magnified when the number of
function evaluations (NFE) is fewer. In this work, we propose PFDiff, a novel
training-free and orthogonal timestep-skipping strategy, which enables existing
fast ODE solvers to operate with fewer NFE. Specifically, PFDiff initially
utilizes score replacement from past time steps to predict a ``springboard".
Subsequently, it employs this ``springboard" along with foresight updates
inspired by Nesterov momentum to rapidly update current intermediate states.
This approach effectively reduces unnecessary NFE while correcting for
discretization errors inherent in first-order ODE solvers. Experimental results
demonstrate that PFDiff exhibits flexible applicability across various
pre-trained DPMs, particularly excelling in conditional DPMs and surpassing
previous state-of-the-art training-free methods. For instance, using DDIM as a
baseline, we achieved 16.46 FID (4 NFE) compared to 138.81 FID with DDIM on
ImageNet 64x64 with classifier guidance, and 13.06 FID (10 NFE) on Stable
Diffusion with 7.5 guidance scale. Code is available at
\url{https://github.com/onefly123/PFDiff}.
comment: Accepted at ICLR 2025
♻ ☆ Text-to-Image Rectified Flow as Plug-and-Play Priors ICLR 2025
Large-scale diffusion models have achieved remarkable performance in
generative tasks. Beyond their initial training applications, these models have
proven their ability to function as versatile plug-and-play priors. For
instance, 2D diffusion models can serve as loss functions to optimize 3D
implicit models. Rectified flow, a novel class of generative models, enforces a
linear progression from the source to the target distribution and has
demonstrated superior performance across various domains. Compared to
diffusion-based methods, rectified flow approaches surpass in terms of
generation quality and efficiency, requiring fewer inference steps. In this
work, we present theoretical and experimental evidence demonstrating that
rectified flow based methods offer similar functionalities to diffusion models
- they can also serve as effective priors. Besides the generative capabilities
of diffusion priors, motivated by the unique time-symmetry properties of
rectified flow models, a variant of our method can additionally perform image
inversion. Experimentally, our rectified flow-based priors outperform their
diffusion counterparts - the SDS and VSD losses - in text-to-3D generation. Our
method also displays competitive performance in image inversion and editing.
comment: ICLR 2025 Camera Ready. Code:
https://github.com/yangxiaofeng/rectified_flow_prior
♻ ☆ Robust Tumor Segmentation with Hyperspectral Imaging and Graph Neural Networks
Mayar Lotfy Mostafa, Anna Alperovich, Tommaso Giannantonio, Bjorn Barz, Xiaohan Zhang, Felix Holm, Nassir Navab, Felix Boehm, Carolin Schwamborn, Thomas K. Hoffmann, Patrick J. Schuler
Segmenting the boundary between tumor and healthy tissue during surgical
cancer resection poses a significant challenge. In recent years, Hyperspectral
Imaging (HSI) combined with Machine Learning (ML) has emerged as a promising
solution. However, due to the extensive information contained within the
spectral domain, most ML approaches primarily classify individual HSI
(super-)pixels, or tiles, without taking into account their spatial context. In
this paper, we propose an improved methodology that leverages the spatial
context of tiles for more robust and smoother segmentation. To address the
irregular shapes of tiles, we utilize Graph Neural Networks (GNNs) to propagate
context information across neighboring regions. The features for each tile
within the graph are extracted using a Convolutional Neural Network (CNN),
which is trained simultaneously with the subsequent GNN. Moreover, we
incorporate local image quality metrics into the loss function to enhance the
training procedure's robustness against low-quality regions in the training
images. We demonstrate the superiority of our proposed method using a clinical
ex vivo dataset consisting of 51 HSI images from 30 patients. Despite the
limited dataset, the GNN-based model significantly outperforms context-agnostic
approaches, accurately distinguishing between healthy and tumor tissues, even
in images from previously unseen patients. Furthermore, we show that our
carefully designed loss function, accounting for local image quality, results
in additional improvements. Our findings demonstrate that context-aware GNN
algorithms can robustly find tumor demarcations on HSI images, ultimately
contributing to better surgery success and patient outcome.
comment: 18 pages, 5 figures, The German Conference on Pattern Recognition
(GCPR) 2024
♻ ☆ CaRtGS: Computational Alignment for Real-Time Gaussian Splatting SLAM
Simultaneous Localization and Mapping (SLAM) is pivotal in robotics, with
photorealistic scene reconstruction emerging as a key challenge. To address
this, we introduce Computational Alignment for Real-Time Gaussian Splatting
SLAM (CaRtGS), a novel method enhancing the efficiency and quality of
photorealistic scene reconstruction in real-time environments. Leveraging 3D
Gaussian Splatting (3DGS), CaRtGS achieves superior rendering quality and
processing speed, which is crucial for scene photorealistic reconstruction. Our
approach tackles computational misalignment in Gaussian Splatting SLAM
(GS-SLAM) through an adaptive strategy that enhances optimization iterations,
addresses long-tail optimization, and refines densification. Experiments on
Replica, TUM-RGBD, and VECtor datasets demonstrate CaRtGS's effectiveness in
achieving high-fidelity rendering with fewer Gaussian primitives. This work
propels SLAM towards real-time, photorealistic dense rendering, significantly
advancing photorealistic scene representation. For the benefit of the research
community, we release the code and accompanying videos on our project website:
https://dapengfeng.github.io/cartgs.
comment: Accepted by IEEE Robotics and Automation Letters (RA-L)
♻ ☆ RhythmFormer: Extracting Patterned rPPG Signals based on Periodic Sparse Attention
Remote photoplethysmography (rPPG) is a non-contact method for detecting
physiological signals based on facial videos, holding high potential in various
applications. Due to the periodicity nature of rPPG signals, the long-range
dependency capturing capacity of the transformer was assumed to be advantageous
for such signals. However, existing methods have not conclusively demonstrated
the superior performance of transformers over traditional convolutional neural
networks. This may be attributed to the quadratic scaling exhibited by
transformer with sequence length, resulting in coarse-grained feature
extraction, which in turn affects robustness and generalization. To address
that, this paper proposes a periodic sparse attention mechanism based on
temporal attention sparsity induced by periodicity. A pre-attention stage is
introduced before the conventional attention mechanism. This stage learns
periodic patterns to filter out a large number of irrelevant attention
computations, thus enabling fine-grained feature extraction. Moreover, to
address the issue of fine-grained features being more susceptible to noise
interference, a fusion stem is proposed to effectively guide self-attention
towards rPPG features. It can be easily integrated into existing methods to
enhance their performance. Extensive experiments show that the proposed method
achieves state-of-the-art performance in both intra-dataset and cross-dataset
evaluations. The codes are available at
https://github.com/zizheng-guo/RhythmFormer.
♻ ☆ An Open-Source Tool for Mapping War Destruction at Scale in Ukraine using Sentinel-1 Time Series
Olivier Dietrich, Torben Peters, Vivien Sainte Fare Garnot, Valerie Sticher, Thao Ton-That Whelan, Konrad Schindler, Jan Dirk Wegner
Access to detailed war impact assessments is crucial for humanitarian
organizations to assist affected populations effectively. However, maintaining
a comprehensive understanding of the situation on the ground is challenging,
especially in widespread and prolonged conflicts. Here we present a scalable
method for estimating building damage resulting from armed conflicts. By
training a machine learning model on Synthetic Aperture Radar image time
series, we generate probabilistic damage estimates at the building level,
leveraging existing damage assessments and open building footprints. To allow
large-scale inference and ensure accessibility, we tie our method to run on
Google Earth Engine. Users can adjust confidence intervals to suit their needs,
enabling rapid and flexible assessments of war-related damage across large
areas. We provide two publicly accessible dashboards: a Ukraine Damage Explorer
to dynamically view our precomputed estimates, and a Rapid Damage Mapping Tool
to run our method and generate custom maps.
♻ ☆ Robust Feature Engineering Techniques for Designing Efficient Motor Imagery-Based BCI-Systems
A multitude of individuals across the globe grapple with motor disabilities.
Neural prosthetics utilizing Brain-Computer Interface (BCI) technology exhibit
promise for improving motor rehabilitation outcomes. The intricate nature of
EEG data poses a significant hurdle for current BCI systems. Recently, a
qualitative repository of EEG signals tied to both upper and lower limb
execution of motor and motor imagery tasks has been unveiled. Despite this, the
productivity of the Machine Learning (ML) Models that were trained on this
dataset was alarmingly deficient, and the evaluation framework seemed
insufficient. To enhance outcomes, robust feature engineering (signal
processing) methodologies are implemented. A collection of time domain,
frequency domain, and wavelet-derived features was obtained from 16-channel EEG
signals, and the Maximum Relevance Minimum Redundancy (MRMR) approach was
employed to identify the four most significant features. For classification K
Nearest Neighbors (KNN), Support Vector Machine (SVM), Decision Tree (DT), and
Na\"ive Bayes (NB) models were implemented with these selected features,
evaluating their effectiveness through metrics such as testing accuracy,
precision, recall, and F1 Score. By leveraging SVM with a Gaussian Kernel, a
remarkable maximum testing accuracy of 92.50% for motor activities and 95.48%
for imagery activities is achieved. These results are notably more dependable
and gratifying compared to the previous study, where the peak accuracy was
recorded at 74.36%. This research work provides an in-depth analysis of the MI
Limb EEG dataset and it will help in designing and developing simple,
cost-effective and reliable BCI systems for neuro-rehabilitation.
comment: 26 pages
♻ ☆ UAVDB: Trajectory-Guided Adaptable Bounding Boxes for UAV Detection
The widespread deployment of Unmanned Aerial Vehicles (UAVs) in surveillance,
security, and airspace management has created an urgent demand for precise,
scalable, and efficient UAV detection. However, existing datasets often suffer
from limited scale diversity and inaccurate annotations, hindering robust model
development. This paper introduces UAVDB, a high-resolution UAV detection
dataset constructed using Patch Intensity Convergence (PIC). This novel
technique automatically generates high-fidelity bounding box annotations from
UAV trajectory data~\cite{li2020reconstruction}, eliminating the need for
manual labeling. UAVDB features single-class annotations with a fixed-camera
setup and consists of RGB frames capturing UAVs across various scales, from
large-scale UAVs to near-single-pixel representations, along with challenging
backgrounds that pose difficulties for modern detectors. We first validate the
accuracy and efficiency of PIC-generated bounding boxes by comparing
Intersection over Union (IoU) performance and runtime against alternative
annotation methods, demonstrating that PIC achieves higher annotation accuracy
while being more efficient. Subsequently, we benchmark UAVDB using
state-of-the-art (SOTA) YOLO-series detectors, establishing UAVDB as a valuable
resource for advancing long-range and high-resolution UAV detection.
comment: 9 pages, 5 figures, 4 tables
♻ ☆ DaBiT: Depth and Blur informed Transformer for Video Focal Deblurring
In many real-world scenarios, recorded videos suffer from accidental focus
blur, and while video deblurring methods exist, most specifically target motion
blur or spatial-invariant blur. This paper introduces a framework optimized for
the as yet unattempted task of video focal deblurring (refocusing). The
proposed method employs novel map-guided transformers, in addition to image
propagation, to effectively leverage the continuous spatial variance of focal
blur and restore the footage. We also introduce a flow re-focusing module
designed to efficiently align relevant features between blurry and sharp
domains. Additionally, we propose a novel technique for generating synthetic
focal blur data, broadening the model's learning capabilities and robustness to
include a wider array of content. We have made a new benchmark dataset,
DAVIS-Blur, available. This dataset, a modified extension of the popular DAVIS
video segmentation set, provides realistic focal blur degradations as well as
the corresponding blur maps. Comprehensive experiments demonstrate the
superiority of our approach. We achieve state-of-the-art results with an
average PSNR performance over 1.9dB greater than comparable existing video
restoration methods. Our source code and the developed databases will be made
available at https://github.com/crispianm/DaBiT
♻ ☆ DSCA: A Digital Subtraction Angiography Sequence Dataset and Spatio-Temporal Model for Cerebral Artery Segmentation
Jiong Zhang, Qihang Xie, Lei Mou, Dan Zhang, Da Chen, Caifeng Shan, Yitian Zhao, Ruisheng Su, Mengguo Guo
Cerebrovascular diseases (CVDs) remain a leading cause of global disability
and mortality. Digital Subtraction Angiography (DSA) sequences, recognized as
the gold standard for diagnosing CVDs, can clearly visualize the dynamic flow
and reveal pathological conditions within the cerebrovasculature. Therefore,
precise segmentation of cerebral arteries (CAs) and classification between
their main trunks and branches are crucial for physicians to accurately
quantify diseases. However, achieving accurate CA segmentation in DSA sequences
remains a challenging task due to small vessels with low contrast, and
ambiguity between vessels and residual skull structures. Moreover, the lack of
publicly available datasets limits exploration in the field. In this paper, we
introduce a DSA Sequence-based Cerebral Artery segmentation dataset (DSCA), the
publicly accessible dataset designed specifically for pixel-level semantic
segmentation of CAs. Additionally, we propose DSANet, a spatio-temporal network
for CA segmentation in DSA sequences. Unlike existing DSA segmentation methods
that focus only on a single frame, the proposed DSANet introduces a separate
temporal encoding branch to capture dynamic vessel details across multiple
frames. To enhance small vessel segmentation and improve vessel connectivity,
we design a novel TemporalFormer module to capture global context and
correlations among sequential frames. Furthermore, we develop a Spatio-Temporal
Fusion (STF) module to effectively integrate spatial and temporal features from
the encoder. Extensive experiments demonstrate that DSANet outperforms other
state-of-the-art methods in CA segmentation, achieving a Dice of 0.9033.
comment: Published by TMI
♻ ☆ Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection
3D visual grounding (3DVG) is challenging because of the requirement of
understanding on visual information, language and spatial relationships. While
supervised approaches have achieved superior performance, they are constrained
by the scarcity and high cost of 3D vision-language datasets. On the other
hand, LLM/VLM based agents are proposed for 3DVG, eliminating the need for
training data. However, these methods incur prohibitive time and token costs
during inference. To address the challenges, we introduce a novel training-free
symbolic framework for 3D visual grounding, namely Evolvable Symbolic Visual
Grounder, that offers significantly reduced inference costs compared to
previous agent-based methods while maintaining comparable performance. EaSe
uses LLM generated codes to compute on spatial relationships. EaSe also
implements an automatic pipeline to evaluate and optimize the quality of these
codes and integrate VLMs to assist in the grounding process. Experimental
results demonstrate that EaSe achieves 52.9% accuracy on Nr3D dataset and 49.2%
Acc@0.25 on ScanRefer, which is top-tier among training-free methods. Moreover,
it substantially reduces the inference time and cost, offering a balanced
trade-off between performance and efficiency. Codes are available at
https://github.com/OpenRobotLab/EaSe.
♻ ☆ Texture and Noise Dual Adaptation for Infrared Image Super-Resolution
Recent efforts have explored leveraging visible light images to enrich
texture details in infrared (IR) super-resolution. However, this direct
adaptation approach often becomes a double-edged sword, as it improves texture
at the cost of introducing noise and blurring artifacts. To address these
challenges, we propose the Target-oriented Domain Adaptation SRGAN (DASRGAN),
an innovative framework specifically engineered for robust IR super-resolution
model adaptation. DASRGAN operates on the synergy of two key components: 1)
Texture-Oriented Adaptation (TOA) to refine texture details meticulously, and
2) Noise-Oriented Adaptation (NOA), dedicated to minimizing noise transfer.
Specifically, TOA uniquely integrates a specialized discriminator,
incorporating a prior extraction branch, and employs a Sobel-guided adversarial
loss to align texture distributions effectively. Concurrently, NOA utilizes a
noise adversarial loss to distinctly separate the generative and Gaussian noise
pattern distributions during adversarial training. Our extensive experiments
confirm DASRGAN's superiority. Comparative analyses against leading methods
across multiple benchmarks and upsampling factors reveal that DASRGAN sets new
state-of-the-art performance standards. Code are available at
\url{https://github.com/yongsongH/DASRGAN}.
comment: Accepted by Pattern Recognition
♻ ☆ Infrared Image Super-Resolution: Systematic Review, and Future Trends
Image Super-Resolution (SR) is essential for a wide range of computer vision
and image processing tasks. Investigating infrared (IR) image (or thermal
images) super-resolution is a continuing concern within the development of deep
learning. This survey aims to provide a comprehensive perspective of IR image
super-resolution, including its applications, hardware imaging system dilemmas,
and taxonomy of image processing methodologies. In addition, the datasets and
evaluation metrics in IR image super-resolution tasks are also discussed.
Furthermore, the deficiencies in current technologies and possible promising
directions for the community to explore are highlighted. To cope with the rapid
development in this field, we intend to regularly update the relevant excellent
work at \url{https://github.com/yongsongH/Infrared_Image_SR_Survey
comment: This work has been submitted to the Pattern Recognition for possible
publication
♻ ☆ Infrared Small Target Detection in Satellite Videos: A New Dataset and A Novel Recurrent Feature Refinement Framework
Xinyi Ying, Li Liu, Zaipin Lin, Yangsi Shi, Yingqian Wang, Ruojing Li, Xu Cao, Boyang Li, Shilin Zhou, Wei An
Multi-frame infrared small target (MIRST) detection in satellite videos is a
long-standing, fundamental yet challenging task for decades, and the challenges
can be summarized as: First, extremely small target size, highly complex
clutters & noises, various satellite motions result in limited feature
representation, high false alarms, and difficult motion analyses. Second, the
lack of large-scale public available MIRST dataset in satellite videos greatly
hinders the algorithm development. To address the aforementioned challenges, in
this paper, we first build a large-scale dataset for MIRST detection in
satellite videos (namely IRSatVideo-LEO), and then develop a recurrent feature
refinement (RFR) framework as the baseline method. Specifically, IRSatVideo-LEO
is a semi-simulated dataset with synthesized satellite motion, target
appearance, trajectory and intensity, which can provide a standard toolbox for
satellite video generation and a reliable evaluation platform to facilitate the
algorithm development. For baseline method, RFR is proposed to be equipped with
existing powerful CNN-based methods for long-term temporal dependency
exploitation and integrated motion compensation & MIRST detection.
Specifically, a pyramid deformable alignment (PDA) module and a
temporal-spatial-frequency modulation (TSFM) module are proposed to achieve
effective and efficient feature alignment, propagation, aggregation and
refinement. Extensive experiments have been conducted to demonstrate the
effectiveness and superiority of our scheme. The comparative results show that
ResUNet equipped with RFR outperforms the state-of-the-art MIRST detection
methods. Dataset and code are released at https://github.com/XinyiYing/RFR.
♻ ☆ MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou
We introduce MedXpertQA, a highly challenging and comprehensive benchmark to
evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA
includes 4,460 questions spanning 17 specialties and 11 body systems. It
includes two subsets, Text for text evaluation and MM for multimodal
evaluation. Notably, MM introduces expert-level exam questions with diverse
images and rich clinical information, including patient records and examination
results, setting it apart from traditional medical multimodal benchmarks with
simple QA pairs generated from image captions. MedXpertQA applies rigorous
filtering and augmentation to address the insufficient difficulty of existing
benchmarks like MedQA, and incorporates specialty board questions to improve
clinical relevance and comprehensiveness. We perform data synthesis to mitigate
data leakage risk and conduct multiple rounds of expert reviews to ensure
accuracy and reliability. We evaluate 16 leading models on MedXpertQA.
Moreover, medicine is deeply connected to real-world decision-making, providing
a rich and representative setting for assessing reasoning abilities beyond
mathematics and code. To this end, we develop a reasoning-oriented subset to
facilitate the assessment of o1-like models.
♻ ☆ Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines
Xinyi Ying, Chao Xiao, Ruojing Li, Xu He, Boyang Li, Xu Cao, Zhaoxu Li, Yingqian Wang, Mingyuan Hu, Qingyu Xu, Zaiping Lin, Miao Li, Shilin Zhou, Wei An, Weidong Sheng, Li Liu
Small object detection (SOD) has been a longstanding yet challenging task for
decades, with numerous datasets and algorithms being developed. However, they
mainly focus on either visible or thermal modality, while visible-thermal
(RGBT) bimodality is rarely explored. Although some RGBT datasets have been
developed recently, the insufficient quantity, limited category, misaligned
images and large target size cannot provide an impartial benchmark to evaluate
multi-category visible-thermal small object detection (RGBT SOD) algorithms. In
this paper, we build the first large-scale benchmark with high diversity for
RGBT SOD (namely RGBT-Tiny), including 115 paired sequences, 93K frames and
1.2M manual annotations. RGBT-Tiny contains abundant targets (7 categories) and
high-diversity scenes (8 types that cover different illumination and density
variations). Note that, over 81% of targets are smaller than 16x16, and we
provide paired bounding box annotations with tracking ID to offer an extremely
challenging benchmark with wide-range applications, such as RGBT fusion,
detection and tracking. In addition, we propose a scale adaptive fitness
(SAFit) measure that exhibits high robustness on both small and large targets.
The proposed SAFit can provide reasonable performance evaluation and promote
detection performance. Based on the proposed RGBT-Tiny dataset and SAFit
measure, extensive evaluations have been conducted, including 23 recent
state-of-the-art algorithms that cover four different types (i.e., visible
generic detection, visible SOD, thermal SOD and RGBT object detection). Project
is available at https://github.com/XinyiYing/RGBT-Tiny.
♻ ☆ MambaPlace:Text-to-Point-Cloud Cross-Modal Place Recognition with Attention Mamba Mechanisms
Vision Language Place Recognition (VLVPR) enhances robot localization
performance by incorporating natural language descriptions from images. By
utilizing language information, VLVPR directs robot place matching, overcoming
the constraint of solely depending on vision. The essence of multimodal fusion
lies in mining the complementary information between different modalities.
However, general fusion methods rely on traditional neural architectures and
are not well equipped to capture the dynamics of cross modal interactions,
especially in the presence of complex intra modal and inter modal correlations.
To this end, this paper proposes a novel coarse to fine and end to end
connected cross modal place recognition framework, called MambaPlace. In the
coarse localization stage, the text description and 3D point cloud are encoded
by the pretrained T5 and instance encoder, respectively. They are then
processed using Text Attention Mamba (TAM) and Point Clouds Mamba (PCM) for
data enhancement and alignment. In the subsequent fine localization stage, the
features of the text description and 3D point cloud are cross modally fused and
further enhanced through cascaded Cross Attention Mamba (CCAM). Finally, we
predict the positional offset from the fused text point cloud features,
achieving the most accurate localization. Extensive experiments show that
MambaPlace achieves improved localization accuracy on the KITTI360Pose dataset
compared to the state of the art methods.
comment: 8 pages
♻ ☆ Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics
The Theory of Multiple Intelligences underscores the hierarchical nature of
cognitive capabilities. To advance Spatial Artificial Intelligence, we pioneer
a psychometric framework defining five Basic Spatial Abilities (BSAs) in Visual
Language Models (VLMs): Spatial Perception, Spatial Relation, Spatial
Orientation, Mental Rotation, and Spatial Visualization. Benchmarking 13
mainstream VLMs through nine validated psychometric experiments reveals
significant gaps versus humans (average score 24.95 vs. 68.38), with three key
findings: 1) VLMs mirror human hierarchies (strongest in 2D orientation,
weakest in 3D rotation) with independent BSAs (Pearson's r<0.4); 2) Smaller
models such as Qwen2-VL-7B surpass larger counterparts, with Qwen leading
(30.82) and InternVL2 lagging (19.6); 3) Interventions like chain-of-thought
(0.100 accuracy gain) and 5-shot training (0.259 improvement) show limits from
architectural constraints. Identified barriers include weak geometry encoding
and missing dynamic simulation. By linking psychometric BSAs to VLM
capabilities, we provide a diagnostic toolkit for spatial intelligence
evaluation, methodological foundations for embodied AI development, and a
cognitive science-informed roadmap for achieving human-like spatial
intelligence.
♻ ☆ Surface Vision Mamba: Leveraging Bidirectional State Space Model for Efficient Spherical Manifold Representation
Attention-based methods have demonstrated exceptional performance in
modelling long-range dependencies on spherical cortical surfaces, surpassing
traditional Geometric Deep Learning (GDL) models. However, their extensive
inference time and high memory demands pose challenges for application to large
datasets with limited computing resources. Inspired by the state space model in
computer vision, we introduce the attention-free Vision Mamba (Vim) to
spherical surfaces, presenting a domain-agnostic architecture for analyzing
data on spherical manifolds. Our method achieves surface patching by
representing spherical data as a sequence of triangular patches derived from a
subdivided icosphere. The proposed Surface Vision Mamba (SiM) is evaluated on
multiple neurodevelopmental phenotype regression tasks using cortical surface
metrics from neonatal brains. Experimental results demonstrate that SiM
outperforms both attention- and GDL-based methods, delivering 4.8 times faster
inference and achieving 91.7% lower memory consumption compared to the Surface
Vision Transformer (SiT) under the Ico-4 grid partitioning. Sensitivity
analysis further underscores the potential of SiM to identify subtle cognitive
developmental patterns. The code is available at
https://github.com/Rongzhao-He/surface-vision-mamba.
♻ ☆ AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark ICLR 2025
Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, Christopher D. Manning
Video detailed captioning is a key task which aims to generate comprehensive
and coherent textual descriptions of video content, benefiting both video
understanding and generation. In this paper, we propose AuroraCap, a video
captioner based on a large multimodal model. We follow the simplest
architecture design without additional parameters for temporal modeling. To
address the overhead caused by lengthy video sequences, we implement the token
merging strategy, reducing the number of input visual tokens. Surprisingly, we
found that this strategy results in little performance loss. AuroraCap shows
superior performance on various video and image captioning benchmarks, for
example, obtaining a CIDEr of 88.9 on Flickr30k, beating GPT-4V (55.3) and
Gemini-1.5 Pro (82.2). However, existing video caption benchmarks only include
simple descriptions, consisting of a few dozen words, which limits research in
this field. Therefore, we develop VDC, a video detailed captioning benchmark
with over one thousand carefully annotated structured captions. In addition, we
propose a new LLM-assisted metric VDCscore for bettering evaluation, which
adopts a divide-and-conquer strategy to transform long caption evaluation into
multiple short question-answer pairs. With the help of human Elo ranking, our
experiments show that this benchmark better correlates with human judgments of
video detailed captioning quality.
comment: Accepted to ICLR 2025. Code, docs, weight, benchmark and training
data are all avaliable at https://rese1f.github.io/aurora-web/
♻ ☆ Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder
Recent research has shown that CLIP models struggle with visual reasoning
tasks that require grounding compositionality, understanding spatial
relationships, or capturing fine-grained details. One natural hypothesis is
that the CLIP vision encoder does not embed essential information for these
tasks. However, we find that this is not always the case: The encoder gathers
query-relevant visual information, while CLIP fails to extract it. In
particular, we show that another branch of Vision-Language Models (VLMs),
Generative Multimodal Large Language Models (MLLMs), achieve significantly
higher accuracy than CLIP in many of these tasks using the same vision encoder
and weights, indicating that these Generative MLLMs perceive more -- as they
extract and utilize visual information more effectively. We conduct a series of
controlled experiments and reveal that their success is attributed to multiple
key design choices, including patch tokens, position embeddings, and
prompt-based weighting. On the other hand, enhancing the training data alone or
applying a stronger text encoder does not suffice to solve the task, and
additional text tokens offer little benefit. Interestingly, we find that
fine-grained visual reasoning is not exclusive to generative models trained by
an autoregressive loss: When converted into CLIP-like encoders by contrastive
finetuning, these MLLMs still outperform CLIP under the same cosine
similarity-based evaluation protocol. Our study highlights the importance of
VLM architectural choices and suggests directions for improving the performance
of CLIP-like contrastive VLMs.
comment: 17 pages, 3 figures
♻ ☆ Efficient 3D Perception on Multi-Sweep Point Cloud with Gumbel Spatial Pruning
This paper studies point cloud perception within outdoor environments.
Existing methods face limitations in recognizing objects located at a distance
or occluded, due to the sparse nature of outdoor point clouds. In this work, we
observe a significant mitigation of this problem by accumulating multiple
temporally consecutive point cloud sweeps, resulting in a remarkable
improvement in perception accuracy. However, the computation cost also
increases, hindering previous approaches from utilizing a large number of point
cloud sweeps. To tackle this challenge, we find that a considerable portion of
points in the accumulated point cloud is redundant, and discarding these points
has minimal impact on perception accuracy. We introduce a simple yet effective
Gumbel Spatial Pruning (GSP) layer that dynamically prunes points based on a
learned end-to-end sampling. The GSP layer is decoupled from other network
components and thus can be seamlessly integrated into existing point cloud
network architectures. Without incurring additional computational overhead, we
increase the number of point cloud sweeps from 10, a common practice, to as
many as 40. Consequently, there is a significant enhancement in perception
performance. For instance, in nuScenes 3D object detection and BEV map
segmentation tasks, our pruning strategy improves several 3D perception
baseline methods.
♻ ☆ DeepFracture: A Generative Approach for Predicting Brittle Fractures with Neural Discrete Representation Learning
In the field of brittle fracture animation, generating realistic destruction
animations using physics-based simulation methods is computationally expensive.
While techniques based on Voronoi diagrams or pre-fractured patterns are
effective for real-time applications, they fail to incorporate collision
conditions when determining fractured shapes during runtime. This paper
introduces a novel learning-based approach for predicting fractured shapes
based on collision dynamics at runtime. Our approach seamlessly integrates
realistic brittle fracture animations with rigid body simulations, utilising
boundary element method (BEM) brittle fracture simulations to generate training
data. To integrate collision scenarios and fractured shapes into a deep
learning framework, we introduce generative geometric segmentation, distinct
from both instance and semantic segmentation, to represent 3D fragment shapes.
We propose an eight-dimensional latent code to address the challenge of
optimising multiple discrete fracture pattern targets that share similar
continuous collision latent codes. This code will follow a discrete normal
distribution corresponding to a specific fracture pattern within our latent
impulse representation design. This adaptation enables the prediction of
fractured shapes using neural discrete representation learning. Our
experimental results show that our approach generates considerably more
detailed brittle fractures than existing techniques, while the computational
time is typically reduced compared to traditional simulation methods at
comparable resolutions.
comment: This is a preprint of an article published in the Computer Graphics
Forum. The final authenticated version is available at
(https://doi.org/10.1111/cgf.70002). Please also check the project page:
https://nikoloside.github.io/deepfracture/
♻ ☆ On Memorization in Diffusion Models
Due to their capacity to generate novel and high-quality samples, diffusion
models have attracted significant research interest in recent years. Notably,
the typical training objective of diffusion models, i.e., denoising score
matching, has a closed-form optimal solution that can only generate training
data replicating samples. This indicates that a memorization behavior is
theoretically expected, which contradicts the common generalization ability of
state-of-the-art diffusion models, and thus calls for a deeper understanding.
Looking into this, we first observe that memorization behaviors tend to occur
on smaller-sized datasets, which motivates our definition of effective model
memorization (EMM), a metric measuring the maximum size of training data at
which a learned diffusion model approximates its theoretical optimum. Then, we
quantify the impact of the influential factors on these memorization behaviors
in terms of EMM, focusing primarily on data distribution, model configuration,
and training procedure. Besides comprehensive empirical results identifying the
influential factors, we surprisingly find that conditioning training data on
uninformative random labels can significantly trigger the memorization in
diffusion models. Our study holds practical significance for diffusion model
users and offers clues to theoretical research in deep generative models. Code
is available at https://github.com/sail-sg/DiffMemorize.
comment: TMLR 2025
♻ ☆ SpinQuant: LLM quantization with learned rotations ICLR 2025
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, Tijmen Blankevoort
Post-training quantization (PTQ) techniques applied to weights, activations,
and the KV cache greatly reduce memory usage, latency, and power consumption of
Large Language Models (LLMs), but may lead to large quantization errors when
outliers are present. Rotating activation or weight matrices helps remove
outliers and benefits quantization. In this work, we identify a collection of
applicable rotation parameterizations that lead to identical outputs in
full-precision Transformer architectures while enhancing quantization accuracy.
In addition, we find that some random rotations lead to much better
quantization than others, with an up to 13 points difference in downstream
zero-shot reasoning performance. As a result, we propose SpinQuant, a novel
approach that incorporates learned rotation matrices for optimal quantized
network accuracy. With 4-bit quantization of weight, activation, and KV-cache,
SpinQuant narrows the accuracy gap on zero-shot reasoning tasks with full
precision to merely 2.9 points on the LLaMA-2 7B model, surpassing LLM-QAT by
19.1 points and SmoothQuant by 25.0 points. Furthermore, SpinQuant also
outperforms concurrent work QuaRot, which applies random rotations to remove
outliers. In particular, for LLaMA-3 8B models that are hard to quantize,
SpinQuant reduces the gap to full precision by up to 45.1% relative to QuaRot.
Code is available at https://github.com/facebookresearch/SpinQuant.
comment: ICLR 2025
♻ ☆ Intelligent Anomaly Detection for Lane Rendering Using Transformer with Self-Supervised Pre-Training and Customized Fine-Tuning
The burgeoning navigation services using digital maps provide great
convenience to drivers. Nevertheless, the presence of anomalies in lane
rendering map images occasionally introduces potential hazards, as such
anomalies can be misleading to human drivers and consequently contribute to
unsafe driving conditions. In response to this concern and to accurately and
effectively detect the anomalies, this paper transforms lane rendering image
anomaly detection into a classification problem and proposes a four-phase
pipeline consisting of data pre-processing, self-supervised pre-training with
the masked image modeling (MiM) method, customized fine-tuning using
cross-entropy based loss with label smoothing, and post-processing to tackle it
leveraging state-of-the-art deep learning techniques, especially those
involving Transformer models. Various experiments verify the effectiveness of
the proposed pipeline. Results indicate that the proposed pipeline exhibits
superior performance in lane rendering image anomaly detection, and notably,
the self-supervised pre-training with MiM can greatly enhance the detection
accuracy while significantly reducing the total training time. For instance,
employing the Swin Transformer with Uniform Masking as self-supervised
pretraining (Swin-Trans-UM) yielded a heightened accuracy at 94.77% and an
improved Area Under The Curve (AUC) score of 0.9743 compared with the pure Swin
Transformer without pre-training (Swin-Trans) with an accuracy of 94.01% and an
AUC of 0.9498. The fine-tuning epochs were dramatically reduced to 41 from the
original 280. In conclusion, the proposed pipeline, with its incorporation of
self-supervised pre-training using MiM and other advanced deep learning
techniques, emerges as a robust solution for enhancing the accuracy and
efficiency of lane rendering image anomaly detection in digital navigation
systems.
comment: 26 pages, 7 figures, accepted by the 103rd Transportation Research
Board (TRB) Annual Meeting, under review by Transportation Research Record:
Journal of the Transportation Research Board
♻ ☆ MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding
Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, Linda Ruth Petzold, Stephen D. Wilson, Woosang Lim, William Yang Wang
Scientific figure interpretation is a crucial capability for AI-driven
scientific assistants built on advanced Large Vision Language Models. However,
current datasets and benchmarks primarily focus on simple charts or other
relatively straightforward figures from limited science domains. To address
this gap, we present a comprehensive dataset compiled from peer-reviewed Nature
Communications articles covering 72 scientific fields, encompassing complex
visualizations such as schematic diagrams, microscopic images, and experimental
data which require graduate-level expertise to interpret. We evaluated 19
proprietary and open-source models on two benchmark tasks, figure captioning
and multiple-choice, and conducted human expert annotation. Our analysis
revealed significant task challenges and performance gaps among models. Beyond
serving as a benchmark, this dataset serves as a valuable resource for
large-scale training. Fine-tuning Qwen2-VL-7B with our task-specific data
achieved better performance than GPT-4o and even human experts in
multiple-choice evaluations. Furthermore, continuous pre-training on our
interleaved article and figure data substantially enhanced the model's
downstream task performance in materials science. We have released our dataset
to support further research.
comment: Code and data are available at https://github.com/Leezekun/MMSci
♻ ☆ Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions? ICLR 2025
Egocentric video-language pretraining is a crucial step in advancing the
understanding of hand-object interactions in first-person scenarios. Despite
successes on existing testbeds, we find that current EgoVLMs can be easily
misled by simple modifications, such as changing the verbs or nouns in
interaction descriptions, with models struggling to distinguish between these
changes. This raises the question: Do EgoVLMs truly understand hand-object
interactions? To address this question, we introduce a benchmark called
EgoHOIBench, revealing the performance limitation of current egocentric models
when confronted with such challenges. We attribute this performance gap to
insufficient fine-grained supervision and the greater difficulty EgoVLMs
experience in recognizing verbs compared to nouns. To tackle these issues, we
propose a novel asymmetric contrastive objective named EgoNCE++. For the
video-to-text objective, we enhance text supervision by generating negative
captions using large language models or leveraging pretrained vocabulary for
HOI-related word substitutions. For the text-to-video objective, we focus on
preserving an object-centric feature space that clusters video representations
based on shared nouns. Extensive experiments demonstrate that EgoNCE++
significantly enhances EgoHOI understanding, leading to improved performance
across various EgoVLMs in tasks such as multi-instance retrieval, action
recognition, and temporal understanding. Our code is available at
https://github.com/xuboshen/EgoNCEpp.
comment: Accepted by ICLR 2025. Code: https://github.com/xuboshen/EgoNCEpp
♻ ☆ Pose Prior Learner: Unsupervised Categorical Prior Learning for Pose Estimation
A prior represents a set of beliefs or assumptions about a system, aiding
inference and decision-making. In this paper, we introduce the challenge of
unsupervised categorical prior learning in pose estimation, where AI models
learn a general pose prior for an object category from images in a
self-supervised manner. Although priors are effective in estimating pose,
acquiring them can be difficult. We propose a novel method, named Pose Prior
Learner (PPL), to learn a general pose prior for any object category. PPL uses
a hierarchical memory to store compositional parts of prototypical poses, from
which we distill a general pose prior. This prior improves pose estimation
accuracy through template transformation and image reconstruction. PPL learns
meaningful pose priors without any additional human annotations or
interventions, outperforming competitive baselines on both human and animal
pose estimation datasets. Notably, our experimental results reveal the
effectiveness of PPL using learned prototypical poses for pose estimation on
occluded images. Through iterative inference, PPL leverages the pose prior to
refine estimated poses, regressing them to any prototypical poses stored in
memory. Our code, model, and data will be publicly available.
♻ ☆ CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers
D. She, Mushui Liu, Jingxuan Pang, Jin Wang, Zhen Yang, Wanggui He, Guanghao Zhang, Yi Wang, Qihan Huang, Haobin Tang, Yunlong Yu, Siming Fu
Customized generation has achieved significant progress in image synthesis,
yet personalized video generation remains challenging due to temporal
inconsistencies and quality degradation. In this paper, we introduce
CustomVideoX, an innovative framework leveraging the video diffusion
transformer for personalized video generation from a reference image.
CustomVideoX capitalizes on pre-trained video networks by exclusively training
the LoRA parameters to extract reference features, ensuring both efficiency and
adaptability. To facilitate seamless interaction between the reference image
and video content, we propose 3D Reference Attention, which enables direct and
simultaneous engagement of reference image features with all video frames
across spatial and temporal dimensions. To mitigate the excessive influence of
reference image features and textual guidance on generated video content during
inference, we implement the Time-Aware Reference Attention Bias (TAB) strategy,
dynamically modulating reference bias over different time steps. Additionally,
we introduce the Entity Region-Aware Enhancement (ERAE) module, aligning highly
activated regions of key entity tokens with reference feature injection by
adjusting attention bias. To thoroughly evaluate personalized video generation,
we establish a new benchmark, VideoBench, comprising over 50 objects and 100
prompts for extensive assessment. Experimental results show that CustomVideoX
significantly outperforms existing methods in terms of video consistency and
quality.
comment: Section 4 in CustomVideoX Entity Region-Aware Enhancement has
description errors. The compared methods data of Table I lacks other metrics
♻ ☆ 3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation
Hansheng Chen, Bokui Shen, Yulin Liu, Ruoxi Shi, Linqi Zhou, Connor Z. Lin, Jiayuan Gu, Hao Su, Gordon Wetzstein, Leonidas Guibas
Multi-view image diffusion models have significantly advanced open-domain 3D
object generation. However, most existing models rely on 2D network
architectures that lack inherent 3D biases, resulting in compromised geometric
consistency. To address this challenge, we introduce 3D-Adapter, a plug-in
module designed to infuse 3D geometry awareness into pretrained image diffusion
models. Central to our approach is the idea of 3D feedback augmentation: for
each denoising step in the sampling loop, 3D-Adapter decodes intermediate
multi-view features into a coherent 3D representation, then re-encodes the
rendered RGBD views to augment the pretrained base model through feature
addition. We study two variants of 3D-Adapter: a fast feed-forward version
based on Gaussian splatting and a versatile training-free version utilizing
neural fields and meshes. Our extensive experiments demonstrate that 3D-Adapter
not only greatly enhances the geometry quality of text-to-multi-view models
such as Instant3D and Zero123++, but also enables high-quality 3D generation
using the plain text-to-image Stable Diffusion. Furthermore, we showcase the
broad application potential of 3D-Adapter by presenting high quality results in
text-to-3D, image-to-3D, text-to-texture, and text-to-avatar tasks.
comment: Project page: https://lakonik.github.io/3d-adapter/
♻ ☆ Enhancing Adversarial Robustness of Vision-Language Models through Low-Rank Adaptation
Yuheng Ji, Yue Liu, Zhicheng Zhang, Zhao Zhang, Yuting Zhao, Xiaoshuai Hao, Gang Zhou, Xingwei Zhang, Xiaolong Zheng
Vision-Language Models (VLMs) play a crucial role in the advancement of
Artificial General Intelligence (AGI). As AGI rapidly evolves, addressing
security concerns has emerged as one of the most significant challenges for
VLMs. In this paper, we present extensive experiments that expose the
vulnerabilities of conventional adaptation methods for VLMs, highlighting
significant security risks. Moreover, as VLMs grow in size, the application of
traditional adversarial adaptation techniques incurs substantial computational
costs. To address these issues, we propose a parameter-efficient adversarial
adaptation method called \textbf{\textit{AdvLoRA}} based on Low-Rank
Adaptation. We investigate and reveal the inherent low-rank properties involved
in adversarial adaptation for VLMs. Different from LoRA, we enhance the
efficiency and robustness of adversarial adaptation by introducing a novel
reparameterization method that leverages parameter clustering and alignment.
Additionally, we propose an adaptive parameter update strategy to further
bolster robustness. These innovations enable our AdvLoRA to mitigate issues
related to model security and resource wastage. Extensive experiments confirm
the effectiveness and efficiency of AdvLoRA.
♻ ☆ OccGaussian: 3D Gaussian Splatting for Occluded Human Rendering
Rendering dynamic 3D human from monocular videos is crucial for various
applications such as virtual reality and digital entertainment. Most methods
assume the people is in an unobstructed scene, while various objects may cause
the occlusion of body parts in real-life scenarios. Previous method utilizing
NeRF for surface rendering to recover the occluded areas, but it requiring more
than one day to train and several seconds to render, failing to meet the
requirements of real-time interactive applications. To address these issues, we
propose OccGaussian based on 3D Gaussian Splatting, which can be trained within
6 minutes and produces high-quality human renderings up to 160 FPS with
occluded input. OccGaussian initializes 3D Gaussian distributions in the
canonical space, and we perform occlusion feature query at occluded regions,
the aggregated pixel-align feature is extracted to compensate for the missing
information. Then we use Gaussian Feature MLP to further process the feature
along with the occlusion-aware loss functions to better perceive the occluded
area. Extensive experiments both in simulated and real-world occlusions,
demonstrate that our method achieves comparable or even superior performance
compared to the state-of-the-art method. And we improving training and
inference speeds by 250x and 800x, respectively. Our code will be available for
research purposes.
comment: We have decided to withdraw this paper because the results require
further verification or additional experimental data. We plan to resubmit an
updated version once the necessary work is completed
♻ ☆ SemiHMER: Semi-supervised Handwritten Mathematical Expression Recognition using pseudo-labels
In this paper, we study semi-supervised Handwritten Mathematical Expression
Recognition (HMER) via exploring both labeled data and extra unlabeled data. We
propose a novel consistency regularization framework, termed SemiHMER, which
introduces dual-branch semi-supervised learning. Specifically, we enforce
consistency between the two networks for the same input image. The
pseudo-label, generated by one perturbed recognition network, is utilized to
supervise the other network using the standard cross-entropy loss. The SemiHMER
consistency encourages high similarity between the predictions of the two
perturbed networks for the same input image and expands the training data by
leveraging unlabeled data with pseudo-labels. We further introduce a
weak-to-strong strategy by applying different levels of augmentation to each
branch, effectively expanding the training data and enhancing the quality of
network training. Additionally, we propose a novel module, the Global Dynamic
Counting Module (GDCM), to enhance the performance of the HMER decoder by
alleviating recognition inaccuracies in long-distance formula recognition and
reducing the occurrence of repeated characters. The experimental results
demonstrate that our work achieves significant performance improvements, with
an average accuracy increase of 5.47% on CROHME14, 4.87% on CROHME16, and 5.25%
on CROHME19, compared to our baselines.
comment: 17 pages,3 figures